Submit to Electronics Review for Electronics Propose a Special Issue

Journal Menu

Journal Browser

Recent Advances in Audio, Speech and Music Processing and Analysis

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Related Special Issue
Published Papers

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Circuit and Signal Processing".

Deadline for manuscript submissions: closed (15 January 2025) | Viewed by 24016

Share This Special Issue

Special Issue Editors

Dr. Athanasios Koutras

E-Mail Website
Guest Editor

Department of Electrical and Computer Engineering, University of the Peloponnese, 24100 Kalamata, Greece
Interests: digital sound processing and analyis; EEG/MEG brain signal analysis
Special Issues, Collections and Topics in MDPI journals

Dr. Chrisoula Alexandraki

E-Mail Website
Guest Editor

Department of Music Technology & Acoustics, Hellenic Mediterranean University, 74133 Rethymnon, Greece
Interests: networked music performance; machine musicianship; music information retrieval; musical acoustics
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Audio plays an important role in everyday life since it is incorporated in various applications from broadcasting and telecommunications to the entertainment, multimedia, and gaming industries. Although less popular than image processing technology, which has overwhelmed the industry in recent years, audio processing in academia is under vigorous research and technological development. The relevant research initiatives are involved with speech recognition, audio compression, noise canceling, speaker verification and identification, voice synthesis, and voice transcription systems, to name a few. Furthermore, with respect to music signals, research initiatives focus on music information retrieval for music streaming and recommendation, networked music making, teaching and performing, autonomous, semi-autonomous computer musicians, and many more. This Special Issue gives the opportunity to disseminate state of the art progress on emerging applications, algorithms, and systems related to audio, speech, and music processing and analysis.

Topics of interest include, but are not limited to:

Audio and speech analysis and recognition.
Deep learning for robust speech recognition systems.
Active noise cancelling systems.
Blind speech separation.
Robust speech recognition in multi-simultaneous speaker environments.
Room acoustics modeling.
Environmental sound recognition.
Music information retrieval.
Networked music performance systems.
Internet of Sounds technologies and applications.
Computer accompaniment and machine musicianship.
Digital music representations and collaborative music making.
Online music education technologies.
Computational approaches to musical acoustics.
Music generation using deep learning.

Dr. Athanasios Koutras
Dr. Chrisoula Alexandraki
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

sound analysis
sound processing
music information retrieval
audio analysis
audio recognition
music technology
computational music cognition

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issue

Recent Advances in Audio, Speech and Music Processing and Analysis, 2nd Edition in Electronics (3 articles)

Published Papers (10 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

Jump to: Review

17 pages, 2093 KB

Open AccessArticle

Investigation of Data Augmentation Techniques in Environmental Sound Recognition

by Anastasios Loukas Sarris, Nikolaos Vryzas, Lazaros Vrysis and Charalampos Dimoulas

Electronics 2024, 13(23), 4719; https://doi.org/10.3390/electronics13234719 - 28 Nov 2024

Cited by 1 | Viewed by 1725

Abstract

The majority of sound events that occur in everyday life, like those caused by animals or household devices, can be included in the environmental sound family. This audio category has not been researched as much as music or speech recognition. One main bottleneck in the design of environmental data-driven monitoring automation is the lack of sufficient data representing each of a wide range of categories. In the context of audio data, an important method to increase the available data is the process of the augmentation of existing datasets. In this study, some of the most widespread time domain data augmentation techniques are studied, along with their effects on the recognition of environmental sounds, through the UrbanSound8K dataset, which consists of ten classes. The confusion matrix and the metrics that can be calculated based on the matrix were used to examine the effect of the augmentation. Also, to address the difficulty that arises when large datasets are augmented, a web-based data augmentation application was created. To evaluate the performance of the data augmentation techniques, a convolutional neural network architecture trained on the original set was used. Moreover, four time domain augmentation techniques were used. Although the parameters of the techniques applied were chosen conservatively, they helped the model to better cluster the data, especially in the four classes in which confusion was high in the initial classification. Furthermore, a web application is presented in which the user can upload their own data and apply these data augmentation techniques to both the audio extract and its time frequency representation, the spectrogram. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

24 pages, 12589 KB

Open AccessArticle

Two-Dimensional Direction-of-Arrival Estimation Using Direct Data Processing Approach in Directional Frequency Analysis and Recording (DIFAR) Sonobuoy

by Amirhossein Nemati, Bijan Zakeri and Amir Masoud Molaei

Electronics 2024, 13(15), 2931; https://doi.org/10.3390/electronics13152931 - 25 Jul 2024

Viewed by 2078

Abstract

Today, the common solutions for underwater source angle detection require manned vessels and towed arrays, which are associated with high costs, risks, and deployment difficulties. An alternative solution for such applications is represented by acoustic vector sensors (AVSs), which are compact, lightweight and moderate in cost, and which have promising performance in terms of the bearing discrimination in two or three dimensions. One of the most popular devices for passive monitoring in underwater surveillance systems that employ AVSs is the directional frequency analysis and recording (DIFAR) sonobuoy. In this paper, direct data-processing (DDP) algorithms are implemented to calculate the azimuth angle of underwater acoustic sources by using short-time Fourier transform (STFT) via the arctan method instead of using fast Fourier transform (FFT). These algorithms for bearing estimation use the ‘Azigram’ to plot the estimated bearing of a source. It is demonstrated that by knowing the active sound intensity of the sound field and applying the inverse tangent to its real part, this matrix can be obtained. Announcing the time and frequency of the source simultaneously is one of the main advantages of this method, enabling the detection of multiple sources concurrently. DDP can also provide more details about sources’ characteristics, such as the frequency of the source and the time of the source’s presence. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

19 pages, 9045 KB

Open AccessArticle

Audio Pre-Processing and Beamforming Implementation on Embedded Systems

by Jian-Hong Wang, Phuong Thi Le, Shih-Jung Kuo, Tzu-Chiang Tai, Kuo-Chen Li, Shih-Lun Chen, Ze-Yu Wang, Tuan Pham, Yung-Hui Li and Jia-Ching Wang

Electronics 2024, 13(14), 2784; https://doi.org/10.3390/electronics13142784 - 15 Jul 2024

Cited by 3 | Viewed by 2605

Abstract

Since the invention of the microphone by Barina in 1876, there have been numerous applications of audio processing, such as phonographs, broadcasting stations, and public address systems, which merely capture and amplify sound and play it back. Nowadays, audio processing involves analysis and noise-filtering techniques. There are various methods for noise filtering, each employing unique algorithms, but they all require two or more microphones for signal processing and analysis. For instance, on mobile phones, two microphones located in different positions are utilized for active noise cancellation (one for primary audio capture and the other for capturing ambient noise). However, a drawback is that when the sound source is distant, it may lead to poor audio capture. To capture sound from distant sources, alternative methods, like blind signal separation and beamforming, are necessary. This paper proposes employing a beamforming algorithm with two microphones to enhance speech and implementing this algorithm on an embedded system. However, prior to beamforming, it is imperative to accurately detect the direction of the sound source to process and analyze the audio from that direction. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

20 pages, 1281 KB

Open AccessFeature PaperArticle

A Feature-Reduction Scheme Based on a Two-Sample t-Test to Eliminate Useless Spectrogram Frequency Bands in Acoustic Event Detection Systems

by Vahid Hajihashemi, Abdorreza Alavi Gharahbagh, Narges Hajaboutalebi, Mohsen Zahraei, José J. M. Machado and João Manuel R. S. Tavares

Electronics 2024, 13(11), 2064; https://doi.org/10.3390/electronics13112064 - 25 May 2024

Cited by 2 | Viewed by 1740

Abstract

Acoustic event detection (AED) systems, combined with video surveillance systems, can enhance urban security and safety by automatically detecting incidents, supporting the smart city concept. AED systems mostly use mel spectrograms as a well-known effective acoustic feature. The spectrogram is a combination of frequency bands. A big challenge is that some of the spectrogram bands may be similar in different events and be useless in AED. Removing useless bands reduces the input feature dimension and is highly desirable. This article proposes a mathematical feature analysis method to identify and eliminate ineffective spectrogram bands and improve AED systems’ efficiency. The proposed approach uses a Student’s t-test to compare frequency bands of the spectrogram from different acoustic events. The similarity between each frequency band among events is calculated using a two-sample t-test, allowing the identification of distinct and similar frequency bands. Removing these bands accelerates the training speed of the used classifier by reducing the number of features, and also enhances the system’s accuracy and efficiency. Based on the obtained results, the proposed method reduces the spectrogram bands by 26.3%. The results showed an average difference of 7.77% in the Jaccard, 4.07% in the Dice, and 5.7% in the Hamming distance between selected bands using train and test datasets. These small values underscore the validity of the obtained results for the test dataset. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

12 pages, 2278 KB

Open AccessArticle

Gated Cross-Attention for Universal Speaker Extraction: Toward Real-World Applications

by Yiru Zhang, Bijing Liu, Yong Yang and Qun Yang

Electronics 2024, 13(11), 2046; https://doi.org/10.3390/electronics13112046 - 24 May 2024

Cited by 1 | Viewed by 2045

Abstract

Current target-speaker extraction (TSE) models have achieved good performance in separating target speech from highly overlapped multi-talker speech. However, in real-world applications, multi-talker speech is often sparsely overlapped, and the target speaker may be absent from the speech mixture, making it difficult for the model to extract the desired speech in such situations. To optimize models for various scenarios, universal speaker extraction has been proposed. However, current models do not distinguish between the presence or absence of the target speaker, resulting in suboptimal performance. In this paper, we propose a gated cross-attention network for universal speaker extraction. In our model, the cross-attention mechanism learns the correlation between the target speaker and the speech to determine whether the target speaker is present. Based on this correlation, the gate mechanism enables the model to focus on extracting speech when the target is present and filter out features when the target is absent. Additionally, we propose a joint loss function to evaluate both the reconstructed target speech and silence. Experiments on the WSJ0-2mix-extr and LibriMix datasets show that our proposed method achieves superior performance over comparison approaches in terms of SI-SDR and WER. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

12 pages, 2954 KB

Open AccessArticle

Audio Recognition of the Percussion Sounds Generated by a 3D Auto-Drum Machine System via Machine Learning

by Spyros Brezas, Alexandros Skoulakis, Maximos Kaliakatsos-Papakostas, Antonis Sarantis-Karamesinis, Yannis Orphanos, Michael Tatarakis, Nektarios A. Papadogiannis, Makis Bakarezos, Evaggelos Kaselouris and Vasilis Dimitriou

Electronics 2024, 13(9), 1787; https://doi.org/10.3390/electronics13091787 - 6 May 2024

Cited by 1 | Viewed by 1822

Abstract

A novel 3D auto-drum machine system for the generation and recording of percussion sounds is developed and presented. The capabilities of the machine, along with a calibration, sound production, and collection protocol are demonstrated. The sounds are generated by a drumstick at pre-defined positions and by known impact forces from the programmable 3D auto-drum machine. The generated percussion sounds are accompanied by the spatial excitation coordinates and the correspondent impact forces, allowing for large databases to be built, which are required by machine learning models. The recordings of the radiated sound by a microphone are analyzed using a pre-trained deep learning model, evaluating the consistency of the physical sample generation method. The results demonstrate the ability to perform regression and classification tasks when fine tuning the deep learning model with the gathered data. The produced databases can properly train machine learning models, aiding in the investigation of alternative and cost-effective materials and geometries with relevant sound characteristics and in the development of accurate vibroacoustic numerical models for studying percussion instruments sound synthesis. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

12 pages, 1510 KB

Open AccessArticle

Modeling Temporal Lobe Epilepsy during Music Large-Scale Form Perception Using the Impulse Pattern Formulation (IPF) Brain Model

by Rolf Bader

Electronics 2024, 13(2), 362; https://doi.org/10.3390/electronics13020362 - 15 Jan 2024

Cited by 2 | Viewed by 1558

Abstract

Musical large-scale form is investigated using an electronic dance music piece fed into a Finite-Difference Time-Domain physical model of the cochlea, which again is input into an Impulse Pattern Formulation (IPF) Brain model. In previous studies, experimental EEG data showed an enhanced correlation between brain synchronization and the musical piece’s amplitude and fractal correlation dimension, representing musical tension and expectancy time points within the large-scale form of musical pieces. This is also in good agreement with a FitzHugh–Nagumo oscillator model.However, this model cannot display temporal developments in large-scale forms. The IPF Brain model shows a high correlation between cochlea input and brain synchronization at the gamma band range around 50 Hz, and also a strong negative correlation with low frequencies, associated with musical rhythm, during time frames with low cochlea input amplitudes. Such a high synchronization corresponds to temporal lobe epilepsy, often associated with creativity or spirituality. Therefore, the IPF Brain model results suggest that these conscious states occur at times of low external input at low frequencies, where isochronous musical rhythms are present. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

18 pages, 4347 KB

Open AccessArticle

Applying the Lombard Effect to Speech-in-Noise Communication

by Gražina Korvel, Krzysztof Kąkol, Povilas Treigys and Bożena Kostek

Electronics 2023, 12(24), 4933; https://doi.org/10.3390/electronics12244933 - 8 Dec 2023

Viewed by 3165

Abstract

This study explored how the Lombard effect, a natural or artificial increase in speech loudness in noisy environments, can improve speech-in-noise communication. This study consisted of several experiments that measured the impact of different types of noise on synthesizing the Lombard effect. The main steps were as follows: first, a dataset of speech samples with and without the Lombard effect was collected in a controlled setting; then, the frequency changes in the speech signals were detected using the McAulay and Quartieri algorithm based on a 2D speech representation; next, an average formant track error was computed as a metric to evaluate the quality of the speech signals in noise. Three image assessment methods, namely the SSIM (Structural SIMilarity) index, RMSE (Root Mean Square Error), and dHash (Difference Hash) were used for this purpose. Furthermore, this study analyzed various spectral features of the speech signals in relation to the Lombard effect and the noise types. Finally, this study proposed a method for automatic noise profiling and applied pitch modifications to neutral speech signals according to the profile and the frequency change patterns. This study used an overlap-add synthesis in the STRAIGHT vocoder to generate the synthesized speech. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

13 pages, 1515 KB

Open AccessArticle

Blind Source Separation with Strength Pareto Evolutionary Algorithm 2 (SPEA2) Using Discrete Wavelet Transform

by Husamettin Celik and Nurhan Karaboga

Electronics 2023, 12(21), 4383; https://doi.org/10.3390/electronics12214383 - 24 Oct 2023

Cited by 4 | Viewed by 1808

Abstract

This paper presents a new method for separating the mixed audio signals of simultaneous speakers using Blind Source Separation (BSS). The separation of mixed signals is an important issue today. In order to obtain more efficient and superior source estimation performance, a new algorithm that solves the BSS problem with Multi-Objective Optimization (MOO) methods was developed in this study. In this direction, we tested the application of two methods. Firstly, the Discrete Wavelet Transform (DWT) was used to eliminate the limited aspects of the traditional methods used in BSS and the small coefficients in the signals. Afterwards, the BSS process was optimized with the multi-purpose Strength Pareto Evolutionary Algorithm 2 (SPEA2). Secondly, the Minkowski distance method was proposed for distance measurement by using density information in the discrimination of individuals with raw fitness values for the concept of Pareto dominance. With this proposed method, the originals (original source signals) were estimated by separating the randomly mixed male and two female speech signals. Simulation and experimental results proved that the efficiency and performance of the proposed method can effectively solve BSS problems. In addition, the Pareto front approximation performance of this method also confirmed that it is superior in the Inverted Generational Distance (IGD) indicator. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

Review

Jump to: Research

14 pages, 2488 KB

Open AccessReview

Review of Advances in Speech Processing with Focus on Artificial Neural Networks

by Douglas O’Shaughnessy

Electronics 2023, 12(13), 2887; https://doi.org/10.3390/electronics12132887 - 30 Jun 2023

Viewed by 2722

Abstract

Speech is the primary way via which most humans communicate. Computers facilitate this transfer of information, especially when people interact with databases. While some methods to manipulate and interpret speech date back many decades (e.g., Fourier analysis), other processing techniques were developed late last century (e.g., linear predictive coding and hidden Markov models). Nonetheless, the last 25 years have seen major advances leading to the wide acceptance of computer-based speech processing, e.g., cellular telephones and real-time online conversations. This paper reviews older techniques and recent methods that focus largely on artificial neural networks. The major highlights in speech research are examined, without delving into mathematical detail, while giving insight into the research choices that have been made. The focus of this work is to understand how and why the discussed methods function well. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Journal Menu

Journal Browser

Recent Advances in Audio, Speech and Music Processing and Analysis

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Related Special Issue

Published Papers (10 papers)

Research

Review

Further Information

Guidelines

MDPI Initiatives

Follow MDPI