sensors-logo

Journal Browser

Journal Browser

Speech, Acoustics, Audio Signal Processing and Applications in Sensors

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: closed (29 February 2020) | Viewed by 33625

Special Issue Editor


E-Mail Website
Guest Editor
School of Electronic and Computer Engineering, Hanyang University, Seoul 133-791, Korea
Interests: deep learning, speech recognition, speaker verification, acoustic event detection, speech enhancement

Special Issue Information

Dear Colleagues,

Over the last decades, speech, acoustic, and audio signal processing have undoubtedly undergone explosive growth with myriad applications, which have become an inevitable part of human life. Additionally, user demand has dramatically increased in a short time and resulted pushed system performance to its limit regarding relevant applications including speech/speaker recognition, source localization, audio event detection, and so forth.

This Special issue aims to present the development of novel speech/acoustic signal processing methods in which signals are collected either from a specific arrangement of sensors or a fusion of sensors of different types.

One of the simplest extensions will be multi-channel speech/acoustic signal processing methods based on novel neural network architectures specifically designed to fully exploit the spatial information residing in the collection of signals, while others may include various secondary sensors that may help boost the performance of conventional systems built on microphone signals data.

Speech, acoustic, and audio signal processing applications using multiple sensors can now be found in AI speaker, robotics, surveillance camera, automobiles, and many other fields

This Special Issue aims to cover some of the possibilities that multi-microphone or multimodal sensors are combined with, at a high level, i.e., advanced artificial intelligences for audio understanding and detection.

Prof. Dr. Joon-Hyuk Chang
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Microphone array speech/acoustic signal processing
  • Multi-sensor speech/acoustic signal processing
  • Deep learning-based speech/speaker recognition via the use of sensors
  • Source localization in 3D spaces
  • Multi-mic-based audio event detection/scene classification
  • Real-time speech and audio processing when multi-sensors are used
  • New structure of deep neural network in handing multi-mic signals

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

13 pages, 411 KiB  
Article
Joint Optimization of Deep Neural Network-Based Dereverberation and Beamforming for Sound Event Detection in Multi-Channel Environments
by Kyoungjin Noh and Joon-Hyuk Chang
Sensors 2020, 20(7), 1883; https://doi.org/10.3390/s20071883 - 28 Mar 2020
Cited by 11 | Viewed by 3418
Abstract
In this paper, we propose joint optimization of deep neural network (DNN)-supported dereverberation and beamforming for the convolutional recurrent neural network (CRNN)-based sound event detection (SED) in multi-channel environments. First, the short-time Fourier transform (STFT) coefficients are calculated from multi-channel audio signals under [...] Read more.
In this paper, we propose joint optimization of deep neural network (DNN)-supported dereverberation and beamforming for the convolutional recurrent neural network (CRNN)-based sound event detection (SED) in multi-channel environments. First, the short-time Fourier transform (STFT) coefficients are calculated from multi-channel audio signals under the noisy and reverberant environments, which are then enhanced by the DNN-supported weighted prediction error (WPE) dereverberation with the estimated masks. Next, the STFT coefficients of the dereverberated multi-channel audio signals are conveyed to the DNN-supported minimum variance distortionless response (MVDR) beamformer in which DNN-supported MVDR beamforming is carried out with the source and noise masks estimated by the DNN. As a result, the single-channel enhanced STFT coefficients are shown at the output and tossed to the CRNN-based SED system, and then, the three modules are jointly trained by the single loss function designed for SED. Furthermore, to ease the difficulty of training a deep learning model for SED caused by the imbalance in the amount of data for each class, the focal loss is used as a loss function. Experimental results show that joint training of DNN-supported dereverberation and beamforming with the SED model under the supervision of focal loss significantly improves the performance under the noisy and reverberant environments. Full article
Show Figures

Figure 1

24 pages, 1703 KiB  
Article
End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture
by Long Zhang, Ziping Zhao, Chunmei Ma, Linlin Shan, Huazhi Sun, Lifen Jiang, Shiwen Deng and Chang Gao
Sensors 2020, 20(7), 1809; https://doi.org/10.3390/s20071809 - 25 Mar 2020
Cited by 38 | Viewed by 5777
Abstract
Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to [...] Read more.
Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network–deep neural network (DNN–DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task. Full article
Show Figures

Figure 1

19 pages, 25111 KiB  
Article
Sound Source Distance Estimation Using Deep Learning: An Image Classification Approach
by Mariam Yiwere and Eun Joo Rhee
Sensors 2020, 20(1), 172; https://doi.org/10.3390/s20010172 - 27 Dec 2019
Cited by 19 | Viewed by 4979
Abstract
This paper presents a sound source distance estimation (SSDE) method using a convolutional recurrent neural network (CRNN). We approach the sound source distance estimation task as an image classification problem, and we aim to classify a given audio signal into one of three [...] Read more.
This paper presents a sound source distance estimation (SSDE) method using a convolutional recurrent neural network (CRNN). We approach the sound source distance estimation task as an image classification problem, and we aim to classify a given audio signal into one of three predefined distance classes—one meter, two meters, and three meters—irrespective of its orientation angle. For the purpose of training, we create a dataset by recording audio signals at the three different distances and three angles in different rooms. The CRNN is trained using time-frequency representations of the audio signals. Specifically, we transform the audio signals into log-scaled mel spectrograms, allowing the convolutional layers to extract the appropriate features required for the classification. When trained and tested with combined datasets from all rooms, the proposed model exhibits high classification accuracies; however, training and testing the model in separate rooms results in lower accuracies, indicating that further study is required to improve the method’s generalization ability. Our experimental results demonstrate that it is possible to estimate sound source distances in known environments by classification using the log-scaled mel spectrogram. Full article
Show Figures

Figure 1

14 pages, 2140 KiB  
Article
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
by Rehan Ahmad, Syed Zubair, Hani Alquhayz and Allah Ditta
Sensors 2019, 19(23), 5163; https://doi.org/10.3390/s19235163 - 25 Nov 2019
Cited by 11 | Viewed by 5883
Abstract
Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual [...] Read more.
Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique. Full article
Show Figures

Figure 1

17 pages, 1074 KiB  
Article
Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings
by Woo Hyun Kang and Nam Soo Kim
Sensors 2019, 19(21), 4709; https://doi.org/10.3390/s19214709 - 30 Oct 2019
Cited by 4 | Viewed by 2197
Abstract
Over the recent years, various research has been conducted to investigate methods for verifying users with a short randomized pass-phrase due to the increasing demand for voice-based authentication systems. In this paper, we propose a novel technique for extracting an i-vector-like feature based [...] Read more.
Over the recent years, various research has been conducted to investigate methods for verifying users with a short randomized pass-phrase due to the increasing demand for voice-based authentication systems. In this paper, we propose a novel technique for extracting an i-vector-like feature based on an adversarially learned inference (ALI) model which summarizes the variability within the Gaussian mixture model (GMM) distribution through a nonlinear process. Analogous to the previously proposed variational autoencoder (VAE)-based feature extractor, the proposed ALI-based model is trained to generate the GMM supervector according to the maximum likelihood criterion given the Baum–Welch statistics of the input utterance. However, to prevent the potential loss of information caused by the Kullback–Leibler divergence (KL divergence) regularization adopted in the VAE-based model training, the newly proposed ALI-based feature extractor exploits a joint discriminator to ensure that the generated latent variable and the GMM supervector are more realistic. The proposed framework is compared with the conventional i-vector and VAE-based methods using the TIDIGITS dataset. Experimental results show that the proposed method can represent the uncertainty caused by the short duration better than the VAE-based method. Furthermore, the proposed approach has shown great performance when applied in association with the standard i-vector framework. Full article
Show Figures

Figure 1

22 pages, 6941 KiB  
Article
Forensic Speaker Verification Using Ordinary Least Squares
by Thyago J. Machado, Jozue Vieira Filho and Mario A. de Oliveira
Sensors 2019, 19(20), 4385; https://doi.org/10.3390/s19204385 - 10 Oct 2019
Cited by 10 | Viewed by 3277
Abstract
In Brazil, the recognition of speakers for forensic purposes still relies on a subjectivity-based decision-making process through a results analysis of untrustworthy techniques. Owing to the lack of a voice database, speaker verification is currently applied to samples specifically collected for confrontation. However, [...] Read more.
In Brazil, the recognition of speakers for forensic purposes still relies on a subjectivity-based decision-making process through a results analysis of untrustworthy techniques. Owing to the lack of a voice database, speaker verification is currently applied to samples specifically collected for confrontation. However, speaker comparative analysis via contested discourse requires the collection of an excessive amount of voice samples for a series of individuals. Further, the recognition system must inform who is the most compatible with the contested voice from pre-selected individuals. Accordingly, this paper proposes using a combination of linear predictive coding (LPC) and ordinary least squares (OLS) as a speaker verification tool for forensic analysis. The proposed recognition technique establishes confidence and similarity upon which to base forensic reports, indicating verification of the speaker of the contested discourse. Therefore, in this paper, an accurate, quick, alternative method to help verify the speaker is contributed. After running seven different tests, this study preliminarily achieved a hit rate of 100% considering a limited dataset (Brazilian Portuguese). Furthermore, the developed method extracts a larger number of formants, which are indispensable for statistical comparisons via OLS. The proposed framework is robust at certain levels of noise, for sentences with the suppression of word changes, and with different quality or even meaningful audio time differences. Full article
Show Figures

Figure 1

9 pages, 494 KiB  
Article
Dual Microphone Voice Activity Detection Based on Reliable Spatial Cues
by Soojoong Hwang, Yu Gwang Jin and Jong Won Shin
Sensors 2019, 19(14), 3056; https://doi.org/10.3390/s19143056 - 11 Jul 2019
Cited by 7 | Viewed by 3147
Abstract
Two main spatial cues that can be exploited for dual microphone voice activity detection (VAD) are the interchannel time difference (ITD) and the interchannel level difference (ILD). While both ITD and ILD provide information on the location of audio sources, they may be [...] Read more.
Two main spatial cues that can be exploited for dual microphone voice activity detection (VAD) are the interchannel time difference (ITD) and the interchannel level difference (ILD). While both ITD and ILD provide information on the location of audio sources, they may be impaired in different manners by background noises and reverberation and therefore can have complementary information. Conventional approaches utilize the statistics from all frequencies with fixed weight, although the information from some time–frequency bins may degrade the performance of VAD. In this letter, we propose a dual microphone VAD scheme based on the spatial cues in reliable frequency bins only, considering the sparsity of the speech signal in the time–frequency domain. The reliability of each time–frequency bin is determined by three conditions on signal energy, ILD, and ITD. ITD-based and ILD-based VADs and statistics are evaluated using the information from selected frequency bins and then combined to produce the final VAD results. Experimental results show that the proposed frequency selective approach enhances the performances of VAD in realistic environments. Full article
Show Figures

Figure 1

20 pages, 583 KiB  
Article
Window-Based Constant Beamwidth Beamformer
by Tao Long, Israel Cohen, Baruch Berdugo, Yan Yang and Jingdong Chen
Sensors 2019, 19(9), 2091; https://doi.org/10.3390/s19092091 - 6 May 2019
Cited by 19 | Viewed by 3402
Abstract
Beamformers have been widely used to enhance signals from a desired direction and suppress noise and interfering signals from other directions. Constant beamwidth beamformers enable a fixed beamwidth over a wide range of frequencies. Most of the existing approaches to design constant beamwidth [...] Read more.
Beamformers have been widely used to enhance signals from a desired direction and suppress noise and interfering signals from other directions. Constant beamwidth beamformers enable a fixed beamwidth over a wide range of frequencies. Most of the existing approaches to design constant beamwidth beamformers are based on optimization algorithms with high computational complexity and are often sensitive to microphone mismatches. Other existing methods are based on adjusting the number of sensors according to the frequency, which simplify the design, but cannot control the sidelobe level. Here, we propose a window-based technique to attain the beamwidth constancy, in which different shapes of standard window functions are applied for different frequency bins as the real weighting coefficients of microphones. Thereby, not only do we keep the beamwidth constant, but we also control the sidelobe level. Simulation results show the advantages of our method compared with existing methods, including lower sidelobe level, higher directivity factor, and higher white noise gain. Full article
Show Figures

Figure 1

Back to TopTop