applsci-logo

Journal Browser

Journal Browser

Machine Learning in Audio Signal Processing and Music Information Retrieval

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 June 2024) | Viewed by 9647

Special Issue Editors


E-Mail Website
Guest Editor
Application of Information and Communication Technologies (ATIC) Research Group, ETSI Telecomunicación, Campus Universitario de Teatinos s/n, 29071 Malaga, Spain
Interests: serious games; digital audio and image processing; pattern analysis and recognition and applications of signal processing techniques and methods
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Application of Information and Communication Technologies (ATIC) Research Group, ETSI Telecomunicación, Campus Universitario de Teatinos s/n, 29071 Malaga, Spain
Interests: music information retrieval; audio signal processing; machine learning; musical acoustics; serious games; eeg signal processing; multimedia aplications
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Machine learning methods and applications have been utilized for a while; recently, there has been a growth in the multimedia content and databases available, as well as computational advances, artificial intelligence techniques, and, especially, deep learning methods. These have spread across all application areas in the multimedia signal application framework and, within this context, in the audio and music signal research topics, including music information retrieval.

These methods cover a wide range of techniques, from classical machine learning methods to the recently developed deep neural networks, with an application in a large variety of tasks including audio classification, source separation, enhancement, transcription, indexation, content creation, entertainment, gaming, etc.

In this context, there is still ample room for research in innovation. This Special Issue aims to provide the research community with a space to share their recent findings and advances. The topics of interest include, but are not limited to, the following:

  • Machine learning methods for music/audio information retrieval, indexation, and querying;
  • Music instrument identification, synthesis, transformation, and classification;
  • Symbolic music processing;
  • Machine learning for the discovery of musical structure, segmentation, and form: melody and motives, harmony, chords and tonality, rhythm, beat, tempo, timbre, instrumentation and voice, style, and genre;
  • Musical content creation: melodies, accompaniment, orchestration, etc;
  • Machine learning methods for natural language processing, text, and web mining;
  • Sound source separation;
  • Music transcription and annotation, alignment, synchronization, and score following. Optical music recognition;
  • Audio fingerprinting;
  • Machine learning approaches for visualization, auralization, and sonification;
  • Music recommendation and playlist generation;
  • Music and health, wellbeing, therapy, music training, and education;
  • Machine learning methods for music and audio in gaming.

Prof. Dr. Lorenzo J. Tardón
Prof. Dr. Isabel Barbancho
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • music information retrieval
  • machine learning for audio and music
  • intelligent audio signal processing
  • audio analysis and transformation

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

26 pages, 12966 KiB  
Article
Optical Medieval Music Recognition—A Complete Pipeline for Historic Chants
by Alexander Hartelt, Tim Eipert and Frank Puppe
Appl. Sci. 2024, 14(16), 7355; https://doi.org/10.3390/app14167355 - 20 Aug 2024
Viewed by 588
Abstract
Manual transcription of music is a tedious work, which can be greatly facilitated by optical music recognition (OMR) software. However, OMR software is error prone in particular for older handwritten documents. This paper introduces and evaluates a pipeline that automates the entire OMR [...] Read more.
Manual transcription of music is a tedious work, which can be greatly facilitated by optical music recognition (OMR) software. However, OMR software is error prone in particular for older handwritten documents. This paper introduces and evaluates a pipeline that automates the entire OMR workflow in the context of the Corpus Monodicum project, enabling the transcription of historical chants. In addition to typical OMR tasks such as staff line detection, layout detection, and symbol recognition, the rarely addressed tasks of text and syllable recognition and assignment of syllables to symbols are tackled. For quantitative and qualitative evaluation, we use documents written in square notation developed in the 11th–12th century, but the methods apply to many other notations as well. Quantitative evaluation measures the number of necessary interventions for correction, which are about 0.4% for layout recognition including the division of text in chants, 2.4% for symbol recognition including pitch and reading order and 2.3% for syllable alignment with correct text and symbols. Qualitative evaluation showed an efficiency gain compared to manual transcription with an elaborate tool by a factor of about 9. In a second use case with printed chants in similar notation from the “Graduale Synopticum”, the evaluation results for symbols are much better except for syllable alignment indicating the difficulty of this task. Full article
Show Figures

Figure 1

16 pages, 1818 KiB  
Article
FFA-BiGRU: Attention-Based Spatial-Temporal Feature Extraction Model for Music Emotion Classification
by Yuping Su, Jie Chen, Ruiting Chai, Xiaojun Wu and Yumei Zhang
Appl. Sci. 2024, 14(16), 6866; https://doi.org/10.3390/app14166866 - 6 Aug 2024
Viewed by 774
Abstract
Music emotion recognition is becoming an important research direction due to its great significance for music information retrieval, music recommendation, and so on. In the task of music emotion recognition, the key to achieving accurate emotion recognition lies in how to extract the [...] Read more.
Music emotion recognition is becoming an important research direction due to its great significance for music information retrieval, music recommendation, and so on. In the task of music emotion recognition, the key to achieving accurate emotion recognition lies in how to extract the affect-salient features fully. In this paper, we propose an end-to-end spatial-temporal feature extraction method named FFA-BiGRU for music emotion classification. Taking the log Mel-spectrogram of music audio as the input, this method employs an attention-based convolutional residual module named FFA, which serves as a spatial feature learning module to obtain multi-scale spatial features. In the FFA module, three group architecture blocks extract multi-level spatial features, each of which consists of a stack of multiple channel-spatial attention-based residual blocks. Then, the output features from FFA are fed into the bidirectional gated recurrent units (BiGRU) module to capture the temporal features of music further. In order to make full use of the extracted spatial and temporal features, the output feature maps of FFA and those of the BiGRU are concatenated in the channel dimension. Finally, the concatenated features are passed through fully connected layers to predict the emotion classification results. The experimental results of the EMOPIA dataset show that the proposed model achieves better classification accuracy than the existing baselines. Meanwhile, the ablation experiments also demonstrate the effectiveness of each part of the proposed method. Full article
Show Figures

Figure 1

17 pages, 897 KiB  
Article
DExter: Learning and Controlling Performance Expression with Diffusion Models
by Huan Zhang, Shreyan Chowdhury, Carlos Eduardo Cancino-Chacón, Jinhua Liang, Simon Dixon and Gerhard Widmer
Appl. Sci. 2024, 14(15), 6543; https://doi.org/10.3390/app14156543 - 26 Jul 2024
Viewed by 910
Abstract
In the pursuit of developing expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. The main challenge faced in performance rendering tasks is the continuous and sequential modeling [...] Read more.
In the pursuit of developing expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. The main challenge faced in performance rendering tasks is the continuous and sequential modeling of expressive timing and dynamics over time, which is critical for capturing the evolving nuances that characterize live musical performances. In this approach, performance parameters are represented in a continuous expression space, and a diffusion model is trained to predict these continuous parameters while being conditioned on a musical score. Furthermore, DExter also enables the generation of interpretations (expressive variations of a performance) guided by perceptually meaningful features by being jointly conditioned on score and perceptual-feature representations. Consequently, we find that our model is useful for learning expressive performance, generating perceptually steered performances, and transferring performance styles. We assess the model through quantitative and qualitative analyses, focusing on specific performance metrics regarding dimensions like asynchrony and articulation, as well as through listening tests that compare generated performances with different human interpretations. The results show that DExter is able to capture the time-varying correlation of the expressive parameters, and it compares well to existing rendering models in subjectively evaluated ratings. The perceptual-feature-conditioned generation and transferring capabilities of DExter are verified via a proxy model predicting perceptual characteristics of differently steered performances. Full article
Show Figures

Figure 1

13 pages, 385 KiB  
Article
Attributes Relevance in Content-Based Music Recommendation System
by Daniel Kostrzewa, Jonatan Chrobak and Robert Brzeski
Appl. Sci. 2024, 14(2), 855; https://doi.org/10.3390/app14020855 - 19 Jan 2024
Cited by 4 | Viewed by 1702
Abstract
The possibility of recommendations of musical songs is becoming increasingly required because of the millions of users and songs included in online databases. Therefore, effective methods that automatically solve this issue need to be created. In this paper, the mentioned task is solved [...] Read more.
The possibility of recommendations of musical songs is becoming increasingly required because of the millions of users and songs included in online databases. Therefore, effective methods that automatically solve this issue need to be created. In this paper, the mentioned task is solved using three basic factors based on genre classification made by neural network, Mel-frequency cepstral coefficients (MFCCs), and the tempo of the song. The recommendation system is built using a probability function based on these three factors. The authors’ contribution to the development of an automatic content-based recommendation system are methods built with the use of the mentioned three factors. Using different combinations of them, four strategies were created. All four strategies were evaluated based on the feedback score of 37 users, who created a total of 300 surveys. The proposed recommendation methods show a definite improvement in comparison with a random method. The obtained results indicate that the MFCC parameters have the greatest impact on the quality of recommendations. Full article
Show Figures

Figure 1

16 pages, 1082 KiB  
Article
Analyzing the Influence of Diverse Background Noises on Voice Transmission: A Deep Learning Approach to Noise Suppression
by Alberto Nogales, Javier Caracuel-Cayuela and Álvaro J. García-Tejedor
Appl. Sci. 2024, 14(2), 740; https://doi.org/10.3390/app14020740 - 15 Jan 2024
Cited by 1 | Viewed by 1465
Abstract
This paper presents an approach to enhancing the clarity and intelligibility of speech in digital communications compromised by various background noises. Utilizing deep learning techniques, specifically a Variational Autoencoder (VAE) with 2D convolutional filters, we aim to suppress background noise in audio signals. [...] Read more.
This paper presents an approach to enhancing the clarity and intelligibility of speech in digital communications compromised by various background noises. Utilizing deep learning techniques, specifically a Variational Autoencoder (VAE) with 2D convolutional filters, we aim to suppress background noise in audio signals. Our method focuses on four simulated environmental noise scenarios: storms, wind, traffic, and aircraft. The training dataset has been obtained from public sources (TED-LIUM 3 dataset, which includes audio recordings from the popular TED-TALK series) combined with these background noises. The audio signals were transformed into 2D power spectrograms, upon which our VAE model was trained to filter out the noise and reconstruct clean audio. Our results demonstrate that the model outperforms existing state-of-the-art solutions in noise suppression. Although differences in noise types were observed, it was challenging to definitively conclude which background noise most adversely affects speech quality. The results have been assessed with objective (mathematical metrics) and subjective (listening to a set of audios by humans) methods. Notably, wind noise showed the smallest deviation between the noisy and cleaned audio, perceived subjectively as the most improved scenario. Future work should involve refining the phase calculation of the cleaned audio and creating a more balanced dataset to minimize differences in audio quality across scenarios. Additionally, practical applications of the model in real-time streaming audio are envisaged. This research contributes significantly to the field of audio signal processing by offering a deep learning solution tailored to various noise conditions, enhancing digital communication quality. Full article
Show Figures

Figure 1

15 pages, 5957 KiB  
Article
Deformer: Denoising Transformer for Improved Audio Music Genre Classification
by Jigang Wang, Shuyu Li and Yunsick Sung
Appl. Sci. 2023, 13(23), 12673; https://doi.org/10.3390/app132312673 - 25 Nov 2023
Cited by 2 | Viewed by 2035
Abstract
Audio music genre classification is performed to categorize audio music into various genres. Traditional approaches based on convolutional recurrent neural networks do not consider long temporal information, and their sequential structures result in longer training times and convergence difficulties. To overcome these problems, [...] Read more.
Audio music genre classification is performed to categorize audio music into various genres. Traditional approaches based on convolutional recurrent neural networks do not consider long temporal information, and their sequential structures result in longer training times and convergence difficulties. To overcome these problems, a traditional transformer-based approach was introduced. However, this approach employs pre-training based on momentum contrast (MoCo), a technique that increases computational costs owing to its reliance on extracting many negative samples and its use of highly sensitive hyperparameters. Consequently, this complicates the training process and increases the risk of learning imbalances between positive and negative sample sets. In this paper, a method for audio music genre classification called Deformer is proposed. The Deformer learns deep representations of audio music data through a denoising process, eliminating the need for MoCo and additional hyperparameters, thus reducing computational costs. In the denoising process, it employs a prior decoder to reconstruct the audio patches, thereby enhancing the interpretability of the representations. By calculating the mean squared error loss between the reconstructed and real patches, Deformer can learn a more refined representation of the audio data. The performance of the proposed method was experimentally compared with that of two distinct baseline models: one based on S3T and one employing a residual neural network-bidirectional gated recurrent unit (ResNet-BiGRU). The Deformer achieved an 84.5% accuracy, surpassing both the ResNet-BiGRU-based (81%) and S3T-based (81.1%) models, highlighting its superior performance in audio classification. Full article
Show Figures

Figure 1

19 pages, 20258 KiB  
Article
Design of a Semantic Understanding System for Optical Staff Symbols
by Fengbin Lou, Yaling Lu and Guangyu Wang
Appl. Sci. 2023, 13(23), 12627; https://doi.org/10.3390/app132312627 - 23 Nov 2023
Viewed by 1103
Abstract
Symbolic semantic understanding of staff images is an important technological support to achieve “intelligent score flipping”. Due to the complex composition of staff symbols and the strong semantic correlation between symbol spaces, it is difficult to understand the pitch and duration of each [...] Read more.
Symbolic semantic understanding of staff images is an important technological support to achieve “intelligent score flipping”. Due to the complex composition of staff symbols and the strong semantic correlation between symbol spaces, it is difficult to understand the pitch and duration of each note when the staff is performed. In this paper, we design a semantic understanding system for optical staff symbols. The system uses the YOLOv5 to implement the optical staff’s low-level semantic understanding stage, which understands the pitch and duration in natural scales and other symbols that affect the pitch and duration. The proposed note encoding reconstruction algorithm is used to implement the high-level semantic understanding stage. Such an algorithm understands the logical, spatial, and temporal relationships between natural scales and other symbols based on music theory and outputs digital codes for the pitch and duration of the main notes during performances. The model is trained with a self-constructed SUSN dataset. Experimental results with YOLOv5 show that the precision is 0.989 and that the recall is 0.972. The system’s error rate is 0.031, and the omission rate is 0.021. The paper concludes by analyzing the causes of semantic understanding errors and offers recommendations for further research. The results of this paper provide a method for multimodal music artificial intelligence applications such as notation recognition through listening, intelligent score flipping, and automatic performance. Full article
Show Figures

Figure 1

Back to TopTop