applsci-logo

Journal Browser

Journal Browser

Audio and Acoustic Signal Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: closed (28 February 2023) | Viewed by 44317

Special Issue Editors


E-Mail Website
Guest Editor
Department of Electrical, Electronic and Information Engineering, Faculty of Engineering Science, Kansai University, Osaka 564-8680, Japan
Interests: audio and acoustic signal processing; active noise control; sound reproduction
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Electrical Engineering, Chung Yuan Christian University, Chung Li, Taoyuan 32023, Taiwan
Interests: real-time digital signal processing; sensors and measurements; active noise control
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Acoustic signal processing has various applications such as sourceseparation, reverberation suppression, microphone array, echocancellation, active noise control, speech enhancement, immersivesound, and sound field reproduction. It is also necessary to study thesignal processing methods and algorithms to realize them. Furthermore,technologies related to electro-acoustic transducers such asloudspeakers and microphones are also essential. Further research inthe field of acoustic signal processing, including recent efforts andnew issues, has been conducted in recent years. In this Special Issue,we invite papers on the theory and application of signal processingtechniques required in audio and acoustic systems.

Prof. Dr. Yoshinobu Kajikawa
Prof. Dr. Cheng-Yuan Chang
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • audio and acoustic signal processing
  • source separation
  • speech enhancement
  • acoustic echo cancellation
  • active noise control
  • de-reverberation
  • microphone array
  • immersive audio
  • 3D sound reproduction

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

11 pages, 2779 KiB  
Article
Enhanced Multiple Speakers’ Separation and Identification for VOIP Applications Using Deep Learning
by Amira A. Mohamed, Amira Eltokhy and Abdelhalim A. Zekry
Appl. Sci. 2023, 13(7), 4261; https://doi.org/10.3390/app13074261 - 28 Mar 2023
Cited by 2 | Viewed by 2075
Abstract
Institutions have been adopting work/study-from-home programs since the pandemic began. They primarily utilise Voice over Internet Protocol (VoIP) software to perform online meetings. This research introduces a new method to enhance VoIP calls experience using deep learning. In this paper, integration between two [...] Read more.
Institutions have been adopting work/study-from-home programs since the pandemic began. They primarily utilise Voice over Internet Protocol (VoIP) software to perform online meetings. This research introduces a new method to enhance VoIP calls experience using deep learning. In this paper, integration between two existing techniques, Speaker Separation and Speaker Identification (SSI), is performed using deep learning methods with effective results as introduced by state-of-the-art research. This integration is applied to VoIP system application. The voice signal is introduced to the speaker separation and identification system to be separated; then, the “main speaker voice” is identified and verified rather than any other human or non-human voices around the main speaker. Then, only this main speaker voice is sent over IP to continue the call process. Currently, the online call system depends on noise cancellation and call quality enhancement. However, this does not address multiple human voices over the call. Filters used in the call process only remove the noise and the interference (de-noising speech) from the speech signal. The presented system is tested with up to four mixed human voices. This system separates only the main speaker voice and processes it prior to the transmission over VoIP call. This paper illustrates the algorithm technologies integration using DNN, and voice signal processing advantages and challenges, in addition to the importance of computing power for real-time applications. Full article
(This article belongs to the Special Issue Audio and Acoustic Signal Processing)
Show Figures

Figure 1

18 pages, 2427 KiB  
Article
A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
by Zhongwen Tu, Bin Liu, Wei Zhao, Raoxin Yan and Yang Zou
Appl. Sci. 2023, 13(7), 4124; https://doi.org/10.3390/app13074124 - 24 Mar 2023
Cited by 8 | Viewed by 2583
Abstract
The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small [...] Read more.
The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional speech dataset and the lack of effective emotional feature representation still limit the development of research. In this paper, a novel SER method, combining data augmentation, feature selection and feature fusion, is proposed. First, aiming at the problem that there are inadequate samples in the speech emotion dataset and the number of samples in each category is unbalanced, a speech data augmentation method, Mix-wav, is proposed which is applied to the audio of the same emotion category. Then, on the one hand, a Multi-Head Attention mechanism-based Convolutional Recurrent Neural Network (MHA-CRNN) model is proposed to further extract the spectrum vector from the Log-Mel spectrum. On the other hand, Light Gradient Boosting Machine (LightGBM) is used for feature set selection and feature dimensionality reduction in four emotion global feature sets, and more effective emotion statistical features are extracted for feature fusion with the previously extracted spectrum vector. Experiments are carried out on the public dataset Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Chinese Hierarchical Speech Emotion Dataset of Broadcasting (CHSE-DB). The experiments show that the proposed method achieves 66.44% and 93.47% of the unweighted average test accuracy, respectively. Our research shows that the global feature set after feature selection can supplement the features extracted by a single deep-learning model through feature fusion to achieve better classification accuracy. Full article
(This article belongs to the Special Issue Audio and Acoustic Signal Processing)
Show Figures

Figure 1

15 pages, 11962 KiB  
Article
Speech Enhancement Based on Two-Stage Processing with Deep Neural Network for Laser Doppler Vibrometer
by Chengkai Cai, Kenta Iwai and Takanobu Nishiura
Appl. Sci. 2023, 13(3), 1958; https://doi.org/10.3390/app13031958 - 2 Feb 2023
Viewed by 1876
Abstract
The development of distant-talk measurement systems has been attracting attention since they can be applied to many situations such as security and disaster relief. One such system that uses a device called a laser Doppler vibrometer (LDV) to acquire sound by measuring an [...] Read more.
The development of distant-talk measurement systems has been attracting attention since they can be applied to many situations such as security and disaster relief. One such system that uses a device called a laser Doppler vibrometer (LDV) to acquire sound by measuring an object’s vibration caused by the sound source has been proposed. Different from traditional microphones, an LDV can pick up the target sound from a distance even in a noisy environment. However, the acquired sounds are greatly distorted due to the object’s shape and frequency response. Due to the particularity of the degradation of observed speech, conventional methods cannot be effectively applied to LDVs. We propose two speech enhancement methods that are based on two-stage processing with deep neural networks for LDVs. With the first proposed method, the amplitude spectrum of the observed speech is first restored. The phase difference between the observed and clean speech is then estimated using the restored amplitude spectrum. With the other proposed method, the low-frequency components of the observed speech are first restored. The high-frequency components are then estimated by the restored low-frequency components. The evaluation results indicate that they improved the observed speech in sound quality, deterioration degree, and intelligibility. Full article
(This article belongs to the Special Issue Audio and Acoustic Signal Processing)
Show Figures

Figure 1

12 pages, 524 KiB  
Article
Efficient Realization for Third-Order Volterra Filter Based on Singular Value Decomposition
by Yuya Nakahira, Kenta Iwai and Yoshinobu Kajikawa
Appl. Sci. 2022, 12(21), 10710; https://doi.org/10.3390/app122110710 - 22 Oct 2022
Viewed by 1606
Abstract
Nonlinear distortion in loudspeaker systems degrades sound quality and must be properly compensated for by linearization techniques. One technique to reduce nonlinear distortion is to use a Volterra Filter, which approximates the nonlinearity of the target loudspeaker using the Volterra series expansion. In [...] Read more.
Nonlinear distortion in loudspeaker systems degrades sound quality and must be properly compensated for by linearization techniques. One technique to reduce nonlinear distortion is to use a Volterra Filter, which approximates the nonlinearity of the target loudspeaker using the Volterra series expansion. In general, the Volterra Filter is computationally very expensive, and the amount of computation needs to be reduced for real-time processing. In this paper, we propose an efficient implementation of the third-order Volterra filter based on singular value decomposition. The proposed method determines the necessary coefficients based on the symmetry of the third-order Volterra filter and applies singular value decomposition to them. In the filter structure consisting of singular values and their corresponding singular vector, the computational complexity of the third-order Volterra filter can be reduced by eliminating the part of the filter with small singular values. By focusing on the magnitude of the singular values, the proposed method can improve the computational efficiency of the third-order Volterra filter without decreasing its approximation accuracy. Simulation results show that the proposed method can improve the computational efficiency by 60% while maintaining the nonlinear distortion compensation performance of the micro-speaker for smartphones by about 8 dB. Full article
(This article belongs to the Special Issue Audio and Acoustic Signal Processing)
Show Figures

Figure 1

14 pages, 1564 KiB  
Article
Sound Source Localization Indoors Based on Two-Level Reference Points Matching
by Shuopeng Wang, Peng Yang and Hao Sun
Appl. Sci. 2022, 12(19), 9956; https://doi.org/10.3390/app12199956 - 3 Oct 2022
Cited by 2 | Viewed by 1388
Abstract
A dense sample point layout is the conventional approach to ensure the positioning accuracy for fingerprint-based sound source localization (SSL) indoors. However, mass reference point (RPs) matching of online phases may greatly reduce positioning efficiency. In response to this compelling problem, a two-level [...] Read more.
A dense sample point layout is the conventional approach to ensure the positioning accuracy for fingerprint-based sound source localization (SSL) indoors. However, mass reference point (RPs) matching of online phases may greatly reduce positioning efficiency. In response to this compelling problem, a two-level matching strategy is adopted to shrink the adjacent RPs searching scope. In the first-level matching process, two different methods are adopted to shrink the search scope of the online phase in a simple scene and a complex scene. According to the global range of high similarity between adjacent samples in a simple scene, a greedy search method is adopted for fast searching of the sub-database that contains the adjacent RPs. Simultaneously, in accordance with the specific local areas’ range of high similarity between adjacent samples in a complex scene, the clustering method is used for database partitioning, and the RPs search scope can be compressed by sub-database matching. Experimental results show that the two-level RPs matching strategy can effectively improve the RPs matching efficiency for the two different typical indoor scenes on the premise of ensuring the positioning accuracy. Full article
(This article belongs to the Special Issue Audio and Acoustic Signal Processing)
Show Figures

Figure 1

16 pages, 7041 KiB  
Article
Signal Enhancement of Helicopter Rotor Aerodynamic Noise Based on Cyclic Wiener Filtering
by Chengfeng Wu, Chunhua Wei, Yong Wang and Yang Gao
Appl. Sci. 2022, 12(13), 6632; https://doi.org/10.3390/app12136632 - 30 Jun 2022
Cited by 1 | Viewed by 1637
Abstract
The research on helicopter rotor aerodynamic noise becomes imperative with the wide use of helicopters in civilian fields. In this study, a signal enhancement method based on cyclic Wiener filtering was proposed given the cyclostationarity of rotor aerodynamic noise. The noise was adaptively [...] Read more.
The research on helicopter rotor aerodynamic noise becomes imperative with the wide use of helicopters in civilian fields. In this study, a signal enhancement method based on cyclic Wiener filtering was proposed given the cyclostationarity of rotor aerodynamic noise. The noise was adaptively filtered out by performing a group of frequency shifts on the input signal. According to the characteristics of rotor aerodynamic noise, a detection function was constructed to realize the long-distance detection of helicopters. The flight data of the Robinson R44 helicopter was obtained through the field flight experiment and employed as the research object for analysis. The detection range of the Robinson R44 helicopter after cyclic Wiener filtering was increased from 4.114 km to 17.75 km, verifying the feasibility and effectiveness of the proposed method. The efficacy of the proposed detection method was demonstrated and compared in the far-field flight test measurements of the Robinson R44 helicopter. Full article
(This article belongs to the Special Issue Audio and Acoustic Signal Processing)
Show Figures

Figure 1

17 pages, 3826 KiB  
Article
A Deep Learning Method for DOA Estimation with Covariance Matrices in Reverberant Environments
by Qinghua Huang and Weilun Fang
Appl. Sci. 2022, 12(9), 4278; https://doi.org/10.3390/app12094278 - 23 Apr 2022
Cited by 4 | Viewed by 1878
Abstract
Acoustic source localization in the spherical harmonic domain with reverberation has hitherto not been extensively investigated. Moreover, deep learning frameworks have been utilized to estimate the direction-of-arrival (DOA) with spherical microphone arrays under environments with reverberation and noise for low computational complexity and [...] Read more.
Acoustic source localization in the spherical harmonic domain with reverberation has hitherto not been extensively investigated. Moreover, deep learning frameworks have been utilized to estimate the direction-of-arrival (DOA) with spherical microphone arrays under environments with reverberation and noise for low computational complexity and high accuracy. This paper proposes three different covariance matrices as the input features and two different learning strategies for the DOA task. There is a progressive relationship among the three covariance matrices. The second matrix can be obtained by processing the first matrix and it effectively filters out the effects of the microphone array and mode strength to some extent. The third matrix can be obtained by processing the second matrix and it further efficiently removes information irrelevant to location information. In terms of the strategies, the first strategy is a regular learning strategy, while the second strategy is to split the task into three parts to be performed in parallel. Experiments were conducted both on the simulated and real datasets to show that the proposed method has higher accuracy than the conventional methods and lower computational complexity. Thus, the proposed method can effectively resist reverberation and noise. Full article
(This article belongs to the Special Issue Audio and Acoustic Signal Processing)
Show Figures

Figure 1

18 pages, 2571 KiB  
Article
3-D Sound Image Reproduction Method Based on Spherical Harmonic Expansion for 22.2 Multichannel Audio
by Kenta Iwai, Hiromu Suzuki and Takanobu Nishiura
Appl. Sci. 2022, 12(4), 1994; https://doi.org/10.3390/app12041994 - 14 Feb 2022
Cited by 1 | Viewed by 2083
Abstract
In this paper, we propose a three-dimensional (3-D) sound image reproduction method based on spherical harmonic (SH) expansion for 22.2 multichannel audio. 22.2 multichannel audio is a 3-D sound field reproduction system that has been developed for ultra-high definition television (UHDTV). This system [...] Read more.
In this paper, we propose a three-dimensional (3-D) sound image reproduction method based on spherical harmonic (SH) expansion for 22.2 multichannel audio. 22.2 multichannel audio is a 3-D sound field reproduction system that has been developed for ultra-high definition television (UHDTV). This system can reproduce 3-D sound images by simultaneously driving 22 loudspeakers and two sub-woofers. To control the 3-D sound image, vector base amplitude panning (VBAP) is conventionally used. VBAP can control the direction of 3-D sound image by weighting the input signal and emitting it from three loudspeakers. However, VBAP cannot control the distance of the 3-D sound image because it calculates the weight by only considering the image’s direction. To solve this problem, we propose a novel 3-D sound image reconstruction method based on SH expansion. The proposed method can control both the direction and distance of the 3-D sound image by controlling the sound directivity on the basis of spherical harmonics (SHs) and mode matching. The directivity of the 3-D sound image is obtained in the SH domain. In addition, the distance of the 3-D sound image is represented by the mode strength. The signal obtained by the proposed method is then emitted from loudspeakers and the 3-D sound image can be reproduced accurately with consideration of not only the direction but also the distance. A number of experimental results show that the proposed method can control both the direction and distance of 3-D sound images. Full article
(This article belongs to the Special Issue Audio and Acoustic Signal Processing)
Show Figures

Figure 1

Review

Jump to: Research

25 pages, 2693 KiB  
Review
Mouth Sounds: A Review of Acoustic Applications and Methodologies
by Norberto E. Naal-Ruiz, Erick A. Gonzalez-Rodriguez, Gustavo Navas-Reascos, Rebeca Romo-De Leon, Alejandro Solorio, Luz M. Alonso-Valerdi and David I. Ibarra-Zarate
Appl. Sci. 2023, 13(7), 4331; https://doi.org/10.3390/app13074331 - 29 Mar 2023
Cited by 2 | Viewed by 8335
Abstract
Mouth sounds serve several purposes, from the clinical diagnosis of diseases to emotional recognition. The following review aims to synthesize and discuss the different methods to apply, extract, analyze, and classify the acoustic features of mouth sounds. The most analyzed features were the [...] Read more.
Mouth sounds serve several purposes, from the clinical diagnosis of diseases to emotional recognition. The following review aims to synthesize and discuss the different methods to apply, extract, analyze, and classify the acoustic features of mouth sounds. The most analyzed features were the zero-crossing rate, power/energy-based, and amplitude-based features in the time domain; and tonal-based, spectral-based, and cepstral features in the frequency domain. Regarding acoustic feature analysis, t-tests, variations of analysis of variance, and Pearson’s correlation tests were the most-used statistical tests used for feature evaluation, while the support vector machine and gaussian mixture models were the most used machine learning methods for pattern recognition. Neural networks were employed according to data availability. The main applications of mouth sound research were physical and mental condition monitoring. Nonetheless, other applications, such as communication, were included in the review. Finally, the limitations of the studies are discussed, indicating the need for standard procedures for mouth sound acquisition and analysis. Full article
(This article belongs to the Special Issue Audio and Acoustic Signal Processing)
Show Figures

Figure 1

13 pages, 591 KiB  
Review
Overview of Voice Conversion Methods Based on Deep Learning
by Tomasz Walczyna and Zbigniew Piotrowski
Appl. Sci. 2023, 13(5), 3100; https://doi.org/10.3390/app13053100 - 28 Feb 2023
Cited by 16 | Viewed by 18945
Abstract
Voice conversion is a process where the essence of a speaker’s identity is seamlessly transferred to another speaker, all while preserving the content of their speech. This usage is accomplished using algorithms that blend speech processing techniques, such as speech analysis, speaker classification, [...] Read more.
Voice conversion is a process where the essence of a speaker’s identity is seamlessly transferred to another speaker, all while preserving the content of their speech. This usage is accomplished using algorithms that blend speech processing techniques, such as speech analysis, speaker classification, and vocoding. The cutting-edge voice conversion technology is characterized by deep neural networks that effectively separate a speaker’s voice from their linguistic content. This article offers a comprehensive overview of the development status of this area of science based on the current state-of-the-art voice conversion methods. Full article
(This article belongs to the Special Issue Audio and Acoustic Signal Processing)
Show Figures

Figure 1

Back to TopTop