Recent Advances in Audio, Speech and Music Processing and Analysis

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Circuit and Signal Processing".

Deadline for manuscript submissions: 15 January 2025 | Viewed by 9150

Special Issue Editors


E-Mail Website
Guest Editor
Department of Music Technology & Acoustics, Hellenic Mediterranean University, 74133 Rethymnon, Greece
Interests: networked music performance; machine musicianship; music information retrieval; musical acoustics

Special Issue Information

Dear Colleagues,

Audio plays an important role in everyday life since it is incorporated in various applications from broadcasting and telecommunications to the entertainment, multimedia, and gaming industries. Although less popular than image processing technology, which has overwhelmed the industry in recent years, audio processing in academia is under vigorous research and technological development. The relevant research initiatives are involved with speech recognition, audio compression, noise canceling, speaker verification and identification, voice synthesis, and voice transcription systems, to name a few. Furthermore, with respect to music signals, research initiatives focus on music information retrieval for music streaming and recommendation, networked music making, teaching and performing, autonomous, semi-autonomous computer musicians, and many more. This Special Issue gives the opportunity to disseminate state of the art progress on emerging applications, algorithms, and systems related to audio, speech, and music processing and analysis. 

Topics of interest include, but are not limited to:

  • Audio and speech analysis and recognition.
  • Deep learning for robust speech recognition systems.
  • Active noise cancelling systems.
  • Blind speech separation.
  • Robust speech recognition in multi-simultaneous speaker environments.
  • Room acoustics modeling.
  • Environmental sound recognition.
  • Music information retrieval.
  • Networked music performance systems.
  • Internet of Sounds technologies and applications.
  • Computer accompaniment and machine musicianship.
  • Digital music representations and collaborative music making.
  • Online music education technologies.
  • Computational approaches to musical acoustics.
  • Music generation using deep learning.

Dr. Athanasios Koutras
Dr. Chrisoula Alexandraki
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • sound analysis
  • sound processing
  • music information retrieval
  • audio analysis
  • audio recognition
  • music technology
  • computational music cognition

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

24 pages, 12589 KiB  
Article
Two-Dimensional Direction-of-Arrival Estimation Using Direct Data Processing Approach in Directional Frequency Analysis and Recording (DIFAR) Sonobuoy
by Amirhossein Nemati, Bijan Zakeri and Amir Masoud Molaei
Electronics 2024, 13(15), 2931; https://doi.org/10.3390/electronics13152931 - 25 Jul 2024
Viewed by 502
Abstract
Today, the common solutions for underwater source angle detection require manned vessels and towed arrays, which are associated with high costs, risks, and deployment difficulties. An alternative solution for such applications is represented by acoustic vector sensors (AVSs), which are compact, lightweight and [...] Read more.
Today, the common solutions for underwater source angle detection require manned vessels and towed arrays, which are associated with high costs, risks, and deployment difficulties. An alternative solution for such applications is represented by acoustic vector sensors (AVSs), which are compact, lightweight and moderate in cost, and which have promising performance in terms of the bearing discrimination in two or three dimensions. One of the most popular devices for passive monitoring in underwater surveillance systems that employ AVSs is the directional frequency analysis and recording (DIFAR) sonobuoy. In this paper, direct data-processing (DDP) algorithms are implemented to calculate the azimuth angle of underwater acoustic sources by using short-time Fourier transform (STFT) via the arctan method instead of using fast Fourier transform (FFT). These algorithms for bearing estimation use the ‘Azigram’ to plot the estimated bearing of a source. It is demonstrated that by knowing the active sound intensity of the sound field and applying the inverse tangent to its real part, this matrix can be obtained. Announcing the time and frequency of the source simultaneously is one of the main advantages of this method, enabling the detection of multiple sources concurrently. DDP can also provide more details about sources’ characteristics, such as the frequency of the source and the time of the source’s presence. Full article
(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)
Show Figures

Figure 1

19 pages, 9045 KiB  
Article
Audio Pre-Processing and Beamforming Implementation on Embedded Systems
by Jian-Hong Wang, Phuong Thi Le, Shih-Jung Kuo, Tzu-Chiang Tai, Kuo-Chen Li, Shih-Lun Chen, Ze-Yu Wang, Tuan Pham, Yung-Hui Li and Jia-Ching Wang
Electronics 2024, 13(14), 2784; https://doi.org/10.3390/electronics13142784 - 15 Jul 2024
Viewed by 676
Abstract
Since the invention of the microphone by Barina in 1876, there have been numerous applications of audio processing, such as phonographs, broadcasting stations, and public address systems, which merely capture and amplify sound and play it back. Nowadays, audio processing involves analysis and [...] Read more.
Since the invention of the microphone by Barina in 1876, there have been numerous applications of audio processing, such as phonographs, broadcasting stations, and public address systems, which merely capture and amplify sound and play it back. Nowadays, audio processing involves analysis and noise-filtering techniques. There are various methods for noise filtering, each employing unique algorithms, but they all require two or more microphones for signal processing and analysis. For instance, on mobile phones, two microphones located in different positions are utilized for active noise cancellation (one for primary audio capture and the other for capturing ambient noise). However, a drawback is that when the sound source is distant, it may lead to poor audio capture. To capture sound from distant sources, alternative methods, like blind signal separation and beamforming, are necessary. This paper proposes employing a beamforming algorithm with two microphones to enhance speech and implementing this algorithm on an embedded system. However, prior to beamforming, it is imperative to accurately detect the direction of the sound source to process and analyze the audio from that direction. Full article
(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)
Show Figures

Figure 1

20 pages, 1281 KiB  
Article
A Feature-Reduction Scheme Based on a Two-Sample t-Test to Eliminate Useless Spectrogram Frequency Bands in Acoustic Event Detection Systems
by Vahid Hajihashemi, Abdorreza Alavi Gharahbagh, Narges Hajaboutalebi, Mohsen Zahraei, José J. M. Machado and João Manuel R. S. Tavares
Electronics 2024, 13(11), 2064; https://doi.org/10.3390/electronics13112064 - 25 May 2024
Cited by 1 | Viewed by 821
Abstract
Acoustic event detection (AED) systems, combined with video surveillance systems, can enhance urban security and safety by automatically detecting incidents, supporting the smart city concept. AED systems mostly use mel spectrograms as a well-known effective acoustic feature. The spectrogram is a combination of [...] Read more.
Acoustic event detection (AED) systems, combined with video surveillance systems, can enhance urban security and safety by automatically detecting incidents, supporting the smart city concept. AED systems mostly use mel spectrograms as a well-known effective acoustic feature. The spectrogram is a combination of frequency bands. A big challenge is that some of the spectrogram bands may be similar in different events and be useless in AED. Removing useless bands reduces the input feature dimension and is highly desirable. This article proposes a mathematical feature analysis method to identify and eliminate ineffective spectrogram bands and improve AED systems’ efficiency. The proposed approach uses a Student’s t-test to compare frequency bands of the spectrogram from different acoustic events. The similarity between each frequency band among events is calculated using a two-sample t-test, allowing the identification of distinct and similar frequency bands. Removing these bands accelerates the training speed of the used classifier by reducing the number of features, and also enhances the system’s accuracy and efficiency. Based on the obtained results, the proposed method reduces the spectrogram bands by 26.3%. The results showed an average difference of 7.77% in the Jaccard, 4.07% in the Dice, and 5.7% in the Hamming distance between selected bands using train and test datasets. These small values underscore the validity of the obtained results for the test dataset. Full article
(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)
Show Figures

Figure 1

12 pages, 2278 KiB  
Article
Gated Cross-Attention for Universal Speaker Extraction: Toward Real-World Applications
by Yiru Zhang, Bijing Liu, Yong Yang and Qun Yang
Electronics 2024, 13(11), 2046; https://doi.org/10.3390/electronics13112046 - 24 May 2024
Viewed by 576
Abstract
Current target-speaker extraction (TSE) models have achieved good performance in separating target speech from highly overlapped multi-talker speech. However, in real-world applications, multi-talker speech is often sparsely overlapped, and the target speaker may be absent from the speech mixture, making it difficult for [...] Read more.
Current target-speaker extraction (TSE) models have achieved good performance in separating target speech from highly overlapped multi-talker speech. However, in real-world applications, multi-talker speech is often sparsely overlapped, and the target speaker may be absent from the speech mixture, making it difficult for the model to extract the desired speech in such situations. To optimize models for various scenarios, universal speaker extraction has been proposed. However, current models do not distinguish between the presence or absence of the target speaker, resulting in suboptimal performance. In this paper, we propose a gated cross-attention network for universal speaker extraction. In our model, the cross-attention mechanism learns the correlation between the target speaker and the speech to determine whether the target speaker is present. Based on this correlation, the gate mechanism enables the model to focus on extracting speech when the target is present and filter out features when the target is absent. Additionally, we propose a joint loss function to evaluate both the reconstructed target speech and silence. Experiments on the WSJ0-2mix-extr and LibriMix datasets show that our proposed method achieves superior performance over comparison approaches in terms of SI-SDR and WER. Full article
(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)
Show Figures

Figure 1

12 pages, 2954 KiB  
Article
Audio Recognition of the Percussion Sounds Generated by a 3D Auto-Drum Machine System via Machine Learning
by Spyros Brezas, Alexandros Skoulakis, Maximos Kaliakatsos-Papakostas, Antonis Sarantis-Karamesinis, Yannis Orphanos, Michael Tatarakis, Nektarios A. Papadogiannis, Makis Bakarezos, Evaggelos Kaselouris and Vasilis Dimitriou
Electronics 2024, 13(9), 1787; https://doi.org/10.3390/electronics13091787 - 6 May 2024
Viewed by 835
Abstract
A novel 3D auto-drum machine system for the generation and recording of percussion sounds is developed and presented. The capabilities of the machine, along with a calibration, sound production, and collection protocol are demonstrated. The sounds are generated by a drumstick at pre-defined [...] Read more.
A novel 3D auto-drum machine system for the generation and recording of percussion sounds is developed and presented. The capabilities of the machine, along with a calibration, sound production, and collection protocol are demonstrated. The sounds are generated by a drumstick at pre-defined positions and by known impact forces from the programmable 3D auto-drum machine. The generated percussion sounds are accompanied by the spatial excitation coordinates and the correspondent impact forces, allowing for large databases to be built, which are required by machine learning models. The recordings of the radiated sound by a microphone are analyzed using a pre-trained deep learning model, evaluating the consistency of the physical sample generation method. The results demonstrate the ability to perform regression and classification tasks when fine tuning the deep learning model with the gathered data. The produced databases can properly train machine learning models, aiding in the investigation of alternative and cost-effective materials and geometries with relevant sound characteristics and in the development of accurate vibroacoustic numerical models for studying percussion instruments sound synthesis. Full article
(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)
Show Figures

Figure 1

12 pages, 1510 KiB  
Article
Modeling Temporal Lobe Epilepsy during Music Large-Scale Form Perception Using the Impulse Pattern Formulation (IPF) Brain Model
by Rolf Bader
Electronics 2024, 13(2), 362; https://doi.org/10.3390/electronics13020362 - 15 Jan 2024
Cited by 1 | Viewed by 744
Abstract
Musical large-scale form is investigated using an electronic dance music piece fed into a Finite-Difference Time-Domain physical model of the cochlea, which again is input into an Impulse Pattern Formulation (IPF) Brain model. In previous studies, experimental EEG data showed an enhanced correlation [...] Read more.
Musical large-scale form is investigated using an electronic dance music piece fed into a Finite-Difference Time-Domain physical model of the cochlea, which again is input into an Impulse Pattern Formulation (IPF) Brain model. In previous studies, experimental EEG data showed an enhanced correlation between brain synchronization and the musical piece’s amplitude and fractal correlation dimension, representing musical tension and expectancy time points within the large-scale form of musical pieces. This is also in good agreement with a FitzHugh–Nagumo oscillator model.However, this model cannot display temporal developments in large-scale forms. The IPF Brain model shows a high correlation between cochlea input and brain synchronization at the gamma band range around 50 Hz, and also a strong negative correlation with low frequencies, associated with musical rhythm, during time frames with low cochlea input amplitudes. Such a high synchronization corresponds to temporal lobe epilepsy, often associated with creativity or spirituality. Therefore, the IPF Brain model results suggest that these conscious states occur at times of low external input at low frequencies, where isochronous musical rhythms are present. Full article
(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)
Show Figures

Figure 1

18 pages, 4347 KiB  
Article
Applying the Lombard Effect to Speech-in-Noise Communication
by Gražina Korvel, Krzysztof Kąkol, Povilas Treigys and Bożena Kostek
Electronics 2023, 12(24), 4933; https://doi.org/10.3390/electronics12244933 - 8 Dec 2023
Viewed by 1158
Abstract
This study explored how the Lombard effect, a natural or artificial increase in speech loudness in noisy environments, can improve speech-in-noise communication. This study consisted of several experiments that measured the impact of different types of noise on synthesizing the Lombard effect. The [...] Read more.
This study explored how the Lombard effect, a natural or artificial increase in speech loudness in noisy environments, can improve speech-in-noise communication. This study consisted of several experiments that measured the impact of different types of noise on synthesizing the Lombard effect. The main steps were as follows: first, a dataset of speech samples with and without the Lombard effect was collected in a controlled setting; then, the frequency changes in the speech signals were detected using the McAulay and Quartieri algorithm based on a 2D speech representation; next, an average formant track error was computed as a metric to evaluate the quality of the speech signals in noise. Three image assessment methods, namely the SSIM (Structural SIMilarity) index, RMSE (Root Mean Square Error), and dHash (Difference Hash) were used for this purpose. Furthermore, this study analyzed various spectral features of the speech signals in relation to the Lombard effect and the noise types. Finally, this study proposed a method for automatic noise profiling and applied pitch modifications to neutral speech signals according to the profile and the frequency change patterns. This study used an overlap-add synthesis in the STRAIGHT vocoder to generate the synthesized speech. Full article
(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)
Show Figures

Figure 1

13 pages, 1515 KiB  
Article
Blind Source Separation with Strength Pareto Evolutionary Algorithm 2 (SPEA2) Using Discrete Wavelet Transform
by Husamettin Celik and Nurhan Karaboga
Electronics 2023, 12(21), 4383; https://doi.org/10.3390/electronics12214383 - 24 Oct 2023
Cited by 1 | Viewed by 1089
Abstract
This paper presents a new method for separating the mixed audio signals of simultaneous speakers using Blind Source Separation (BSS). The separation of mixed signals is an important issue today. In order to obtain more efficient and superior source estimation performance, a new [...] Read more.
This paper presents a new method for separating the mixed audio signals of simultaneous speakers using Blind Source Separation (BSS). The separation of mixed signals is an important issue today. In order to obtain more efficient and superior source estimation performance, a new algorithm that solves the BSS problem with Multi-Objective Optimization (MOO) methods was developed in this study. In this direction, we tested the application of two methods. Firstly, the Discrete Wavelet Transform (DWT) was used to eliminate the limited aspects of the traditional methods used in BSS and the small coefficients in the signals. Afterwards, the BSS process was optimized with the multi-purpose Strength Pareto Evolutionary Algorithm 2 (SPEA2). Secondly, the Minkowski distance method was proposed for distance measurement by using density information in the discrimination of individuals with raw fitness values for the concept of Pareto dominance. With this proposed method, the originals (original source signals) were estimated by separating the randomly mixed male and two female speech signals. Simulation and experimental results proved that the efficiency and performance of the proposed method can effectively solve BSS problems. In addition, the Pareto front approximation performance of this method also confirmed that it is superior in the Inverted Generational Distance (IGD) indicator. Full article
(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)
Show Figures

Figure 1

Review

Jump to: Research

14 pages, 2488 KiB  
Review
Review of Advances in Speech Processing with Focus on Artificial Neural Networks
by Douglas O’Shaughnessy
Electronics 2023, 12(13), 2887; https://doi.org/10.3390/electronics12132887 - 30 Jun 2023
Viewed by 1282
Abstract
Speech is the primary way via which most humans communicate. Computers facilitate this transfer of information, especially when people interact with databases. While some methods to manipulate and interpret speech date back many decades (e.g., Fourier analysis), other processing techniques were developed late [...] Read more.
Speech is the primary way via which most humans communicate. Computers facilitate this transfer of information, especially when people interact with databases. While some methods to manipulate and interpret speech date back many decades (e.g., Fourier analysis), other processing techniques were developed late last century (e.g., linear predictive coding and hidden Markov models). Nonetheless, the last 25 years have seen major advances leading to the wide acceptance of computer-based speech processing, e.g., cellular telephones and real-time online conversations. This paper reviews older techniques and recent methods that focus largely on artificial neural networks. The major highlights in speech research are examined, without delving into mathematical detail, while giving insight into the research choices that have been made. The focus of this work is to understand how and why the discussed methods function well. Full article
(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)
Show Figures

Figure 1

Back to TopTop