MDPI - Publisher of Open Access Journals

20 pages, 23952 KB

Open AccessArticle

Deepfake Speech Detection Using Perceptual Pathological Features Related to Timbral Attributes and Deep Learning

by Anuwat Chaiwongyen, Khalid Zaman, Kai Li, Suradej Duangpummet, Jessada Karnjana, Waree Kongprawechnon and Masashi Unoki

Appl. Sci. 2026, 16(4), 2077; https://doi.org/10.3390/app16042077 - 20 Feb 2026

Viewed by 797

Abstract

The detection of deepfake speech has become a significant research area due to rapid advancements in generative AI for speech synthesis. These technologies pose significant security risks in applications such as biometric authentication, voice-controlled systems, and automatic speaker verification (ASV) systems. Therefore, enhancing [...] Read more.

The detection of deepfake speech has become a significant research area due to rapid advancements in generative AI for speech synthesis. These technologies pose significant security risks in applications such as biometric authentication, voice-controlled systems, and automatic speaker verification (ASV) systems. Therefore, enhancing the detection capabilities of such applications is essential to mitigate potential threats. This study investigates perceptual speech-pathological features, which are commonly used to evaluate the unnaturalness of voice disorders in clinical settings, as potential indicators for detecting deepfake speech. Specifically, the timbral attributes of hardness, depth, brightness, roughness, sharpness, warmth, boominess, and reverberation are examined. The analysis reveals that these attributes provide meaningful distinctions between genuine and synthetic speech. Furthermore, the detection performance is enhanced by extending the dimensional representation of timbral attributes, enabling a more comprehensive characterization of the speech signal. This paper proposes a method that combines two models: one utilizing the different dimensions of speech-pathological features with a deep neural network (DNN), and another employing a gammatone filterbank model that simulates the auditory processing mechanism of the human cochlea with ResNet-18 architecture, improving deepfake speech detection. The proposed method is evaluated on the Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof) 2019 dataset. Experimental results demonstrate that the proposed approach outperforms baseline models in terms of Equal Error Rate (EER), achieving an EER of

5.93

%. Full article

(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)

► Show Figures

Figure 1

18 pages, 4265 KB

Open AccessArticle

Animal Species Classification from Vocalizations Using Cochlear-Inspired Audio Features and Machine Learning

by Karim Youssef, Julien Moussa H. Barakat, Ghina El Mir, Sherif Said, Samer Al Kork and Alaa Eleyan

Biomimetics 2025, 10(12), 830; https://doi.org/10.3390/biomimetics10120830 - 11 Dec 2025

Viewed by 1434

Abstract

Biomimetic approaches have gained increasing attention in the development of efficient computational models for sound scene analysis. In this paper, we present a sound-based animal species classification method inspired by the auditory processing mechanisms of the human cochlea. The approach employs gammatone filtering [...] Read more.

Biomimetic approaches have gained increasing attention in the development of efficient computational models for sound scene analysis. In this paper, we present a sound-based animal species classification method inspired by the auditory processing mechanisms of the human cochlea. The approach employs gammatone filtering to extract features that capture the distinctive characteristics of animal vocalizations. While gammatone filterbanks themselves are well established in auditory signal processing, their systematic application and evaluation for animal vocalization classification represent the main contribution of this work. Four gammatone-based feature representations are explored and used to train and test an artificial neural network for species classification. The method is evaluated on a dataset comprising vocalizations from 13 animal species with 50 vocalizations per specie and 2.76 seconds per vocalization in average. The evaluations are conducted to study the system parameters in different conditions and system architectures. Although the dataset is limited in scale compared to larger public databases, the results highlight the potential of combining biomimetic cochlear filtering with machine learning to perform reliable and robust species classification through sound. Full article

(This article belongs to the Special Issue Biomimicry for Optimization, Control, and Automation: 3rd Edition)

► Show Figures

Figure 1

16 pages, 7008 KB

Open AccessArticle

Improving Top-Down Attention Network in Speech Separation by Employing Hand-Crafted Filterbank and Parameter-Sharing Transformer

by Aye Nyein Aung and Jeih-weih Hung

Electronics 2024, 13(21), 4174; https://doi.org/10.3390/electronics13214174 - 24 Oct 2024

Viewed by 2057

Abstract

The “cocktail party problem”, the challenge of isolating individual speech signals from a noisy mixture, has traditionally been addressed using statistical methods. However, deep neural networks (DNNs), with their ability to learn complex patterns, have emerged as superior solutions. DNNs excel at capturing [...] Read more.

The “cocktail party problem”, the challenge of isolating individual speech signals from a noisy mixture, has traditionally been addressed using statistical methods. However, deep neural networks (DNNs), with their ability to learn complex patterns, have emerged as superior solutions. DNNs excel at capturing intricate relationships between mixed audio signals and their respective speech sources, enabling them to effectively separate overlapping speech signals in challenging acoustic environments. Recent advances in speech separation systems have drawn inspiration from the brain’s hierarchical sensory information processing, incorporating top-down attention mechanisms. The top-down attention network (TDANet) employs an encoder–decoder architecture with top-down attention to enhance feature modulation and separation performance. By leveraging attention signals from multi-scale input features, TDANet effectively modifies features across different scales using a global attention (GA) module in the encoder–decoder design. Local attention (LA) layers then convert these modulated signals into high-resolution auditory characteristics. In this study, we propose two key modifications to TDANet. First, we substitute the fully trainable convolutional encoder with a deterministic hand-crafted multi-phase gammatone filterbank (MP-GTF), which mimics human hearing. Experimental results demonstrated that this substitution yielded comparable or even slightly superior performance to the original TDANet with a trainable encoder. Second, we replace the single multi-head self-attention (MHSA) layer in the global attention module with a transformer encoder block consisting of multiple MHSA layers. To optimize GPU memory utilization, we introduce a parameter sharing mechanism, dubbed “Reverse Cycle”, across layers in the transformer-based encoder. Our experimental findings indicated that these proposed modifications enabled TDANet to achieve competitive separation performance, rivaling state-of-the-art techniques, while maintaining superior computational efficiency. Full article

(This article belongs to the Special Issue Natural Language Processing Method: Deep Learning and Deep Semantics)

► Show Figures

Figure 1

17 pages, 2235 KB

Open AccessArticle

Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection

by Feng Li, Yujun Hu and Lingling Wang

Sensors 2023, 23(6), 3015; https://doi.org/10.3390/s23063015 - 10 Mar 2023

Cited by 3 | Viewed by 4978

Abstract

Singing-voice separation is a separation task that involves a singing voice and musical accompaniment. In this paper, we propose a novel, unsupervised methodology for extracting a singing voice from the background in a musical mixture. This method is a modification of robust principal [...] Read more.

Singing-voice separation is a separation task that involves a singing voice and musical accompaniment. In this paper, we propose a novel, unsupervised methodology for extracting a singing voice from the background in a musical mixture. This method is a modification of robust principal component analysis (RPCA) that separates a singing voice by using weighting based on gammatone filterbank and vocal activity detection. Although RPCA is a helpful method for separating voices from the music mixture, it fails when one single value, such as drums, is much larger than others (e.g., the accompanying instruments). As a result, the proposed approach takes advantage of varying values between low-rank (background) and sparse matrices (singing voice). Additionally, we propose an expanded RPCA on the cochleagram by utilizing coalescent masking on the gammatone. Finally, we utilize vocal activity detection to enhance the separation outcomes by eliminating the lingering music signal. Evaluation results reveal that the proposed approach provides superior separation outcomes than RPCA on ccMixter and DSD100 datasets. Full article

(This article belongs to the Special Issue Audio Signal Processing for Sensing Technologies)

► Show Figures

Figure 1

21 pages, 2924 KB

Open AccessArticle

A Novel Pathological Voice Identification Technique through Simulated Cochlear Implant Processing Systems

by Rumana Islam, Esam Abdel-Raheem and Mohammed Tarique

Appl. Sci. 2022, 12(5), 2398; https://doi.org/10.3390/app12052398 - 25 Feb 2022

Cited by 21 | Viewed by 4514

Abstract

This paper presents a pathological voice identification system employing signal processing techniques through cochlear implant models. The fundamentals of the biological process for speech perception are investigated to develop this technique. Two cochlear implant models are considered in this work: one uses a [...] Read more.

This paper presents a pathological voice identification system employing signal processing techniques through cochlear implant models. The fundamentals of the biological process for speech perception are investigated to develop this technique. Two cochlear implant models are considered in this work: one uses a conventional bank of bandpass filters, and the other one uses a bank of optimized gammatone filters. The critical center frequencies of those filters are selected to mimic the human cochlear vibration patterns caused by audio signals. The proposed system processes the speech samples and applies a CNN for final pathological voice identification. The results show that the two proposed models adopting bandpass and gammatone filterbanks can discriminate the pathological voices from healthy ones, resulting in F1 scores of 77.6% and 78.7%, respectively, with speech samples. The obtained results of this work are also compared with those of other related published works. Full article

(This article belongs to the Topic Artificial Intelligence in Healthcare)

► Show Figures

Figure 1

15 pages, 3076 KB

Open AccessArticle

A Design Method for Gammachirp Filterbank for Loudness Compensation in Hearing Aids

by Ruxue Guo, Ruiyu Liang, Qingyun Wang and Cairong Zou

Appl. Sci. 2022, 12(4), 1793; https://doi.org/10.3390/app12041793 - 9 Feb 2022

Cited by 4 | Viewed by 2884

Abstract

Because the hearing impaired often experience different degrees of hearing loss along with the loss of frequencies, the loudness compensation algorithm in hearing aids decomposes the speech signal and compensates with different frequency bands based on their audiograms. However, the speech quality of [...] Read more.

Because the hearing impaired often experience different degrees of hearing loss along with the loss of frequencies, the loudness compensation algorithm in hearing aids decomposes the speech signal and compensates with different frequency bands based on their audiograms. However, the speech quality of the compensated signal is unsatisfactory because the traditional filterbanks fail to fully consider the characteristics of human hearing and personalized hearing loss. In this study, an effective design for the gammachirp filterbank for the loudness compensation algorithm was proposed to improve the speech quality of hearing aids. Firstly, a multichannel gammachirp filterbank was employed to decompose the signals. Then, the adjacent bands were merged into one channel, guided by the proposed combination method. After obtaining the personalized filterbank, each band conducted a loudness compensation to match the requirements of the audiograms. The excellent advantage of the gammachirp filterbank is that it can simulate the characteristics of the basilar membrane. Furthermore, the novel channel combination method considers the information from the audiograms and the relationship between frequency ranges and speech intelligibility. The experimental results showed that the proposed multichannel gammachirp filterbank achieves better speech signal decomposition and synthesis, and good performance can be gained with fewer channels. The loudness compensation algorithm based on the gammachirp filterbank effectively improves sentence intelligibility. The sentence recognition rate of the proposed method is higher than that of a system with a gammatone filterbank by approximately 13%. Full article

(This article belongs to the Section Acoustics and Vibrations)

► Show Figures

Figure 1

16 pages, 769 KB

Open AccessArticle

Data-Dependent Feature Extraction Method Based on Non-Negative Matrix Factorization for Weakly Supervised Domestic Sound Event Detection

by Seokjin Lee, Minhan Kim, Seunghyeon Shin, Sooyoung Park and Youngho Jeong

Appl. Sci. 2021, 11(3), 1040; https://doi.org/10.3390/app11031040 - 24 Jan 2021

Cited by 7 | Viewed by 3481

Abstract

In this paper, feature extraction methods are developed based on the non-negative matrix factorization (NMF) algorithm to be applied in weakly supervised sound event detection. Recently, the development of various features and systems have been attempted to tackle the problems of acoustic scene [...] Read more.

In this paper, feature extraction methods are developed based on the non-negative matrix factorization (NMF) algorithm to be applied in weakly supervised sound event detection. Recently, the development of various features and systems have been attempted to tackle the problems of acoustic scene classification and sound event detection. However, most of these systems use data-independent spectral features, e.g., Mel-spectrogram, log-Mel-spectrum, and gammatone filterbank. Some data-dependent feature extraction methods, including the NMF-based methods, recently demonstrated the potential to tackle the problems mentioned above for long-term acoustic signals. In this paper, we further develop the recently proposed NMF-based feature extraction method to enable its application in weakly supervised sound event detection. To achieve this goal, we develop a strategy for training the frequency basis matrix using a heterogeneous database consisting of strongly- and weakly-labeled data. Moreover, we develop a non-iterative version of the NMF-based feature extraction method so that the proposed feature extraction method can be applied as a part of the model structure similar to the modern “on-the-fly” transform method for the Mel-spectrogram. To detect the sound events, the temporal basis is calculated using the NMF method and then used as a feature for the mean-teacher-model-based classifier. The results are improved for the event-wise post-processing method. To evaluate the proposed system, simulations of the weakly supervised sound event detection were conducted using the Detection and Classification of Acoustic Scenes and Events 2020 Task 4 database. The results reveal that the proposed system has F1-score performance comparable with the Mel-spectrogram and gammatonegram and exhibits 3–5% better performance than the log-Mel-spectrum and constant-Q transform. Full article

(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)

► Show Figures

Figure 1

11 pages, 491 KB

Open AccessArticle

Robust Cochlear-Model-Based Speech Recognition

by Mladen Russo, Maja Stella, Marjan Sikora and Vesna Pekić

Computers 2019, 8(1), 5; https://doi.org/10.3390/computers8010005 - 1 Jan 2019

Cited by 15 | Viewed by 10014

Abstract

Accurate speech recognition can provide a natural interface for human–computer interaction. Recognition rates of the modern speech recognition systems are highly dependent on background noise levels and a choice of acoustic feature extraction method can have a significant impact on system performance. This [...] Read more.

Accurate speech recognition can provide a natural interface for human–computer interaction. Recognition rates of the modern speech recognition systems are highly dependent on background noise levels and a choice of acoustic feature extraction method can have a significant impact on system performance. This paper presents a robust speech recognition system based on a front-end motivated by human cochlear processing of audio signals. In the proposed front-end, cochlear behavior is first emulated by the filtering operations of the gammatone filterbank and subsequently by the Inner Hair cell (IHC) processing stage. Experimental results using a continuous density Hidden Markov Model (HMM) recognizer with the proposed Gammatone Hair Cell (GHC) coefficients are lower for clean speech conditions, but demonstrate significant improvement in performance in noisy conditions compared to standard Mel-Frequency Cepstral Coefficients (MFCC) baseline. Full article

► Show Figures

Figure 1

Search Results (8)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (8)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI