Advancing Audio/Speech Machine Learning: From Static to Continual Learning

A special issue of Acoustics (ISSN 2624-599X).

Deadline for manuscript submissions: 22 July 2026 | Viewed by 4299

Special Issue Editor


E-Mail Website
Guest Editor
State Key Laboratory of Complex & Critical Software Environment, College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
Interests: audio signal processing; machine learning; intelligent software systems
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Audio and speech signal processing has traditionally relied on static models developed for fixed datasets. However, real-world audio environments are constantly evolving, with new sounds and contexts emerging over time. In such dynamic settings, conventional models struggle to remain effective without frequent retraining. Continual learning offers a promising solution by enabling audio systems to adapt incrementally to new data while retaining previously acquired knowledge. This capability is particularly important in real-world applications such as healthcare, surveillance, and interactive media, where adaptability, efficiency, and robustness are essential.

Continual learning in audio systems not only enhances model generalization and reduces computational costs but also improves resilience in unpredictable environments. Despite these benefits, current research and practice in audio signal processing rarely support continuous adaptation, and issues such as catastrophic forgetting remain unsolved. Moreover, regular conference sessions often emphasize static learning paradigms and lack dedicated space for addressing the unique challenges associated with continual learning using audio data.

This Special Issue, entitled “Advancing Audio/Speech Machine Learning: From Static to Continual Learning”, therefore welcomes the submission of original research articles, technical reports, reviews, and mini-reviews that address topics including, but not limited to, the following:

  • Continual learning algorithms for audio and speech;
  • Adaptive audio systems for dynamic environments;
  • Real-time speech recognition and adaptation;
  • Cognitive and contextual audio processing;
  • Audio model generalization and robustness;
  • Integration of speech feedback mechanisms;
  • Cross-domain continual learning in audio applications;
  • Mitigating catastrophic forgetting in sequential audio tasks;
  • Evaluation frameworks for continual learning in audio systems;
  • Interdisciplinary approaches to adaptive audio processing.

Dr. Kele Xu
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Acoustics is an international peer-reviewed open access quarterly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • audio/speech signal processing
  • speech recognition enhancement
  • continual learning algorithms
  • adaptive audio systems
  • real-time speech adaptation
  • cognitive audio processing
  • audio contextual learning
  • speech feedback integration
  • audio model generalization
  • audio system robustness

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (3 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

26 pages, 3165 KB  
Article
Analysis of Fundamental Frequency Changes in Astronaut Speech in Microgravity and in Terrestrial Conditions
by Natalia Repyuk, Anton Konev, Vladimir Faerman, Dmitry Rulev and Grigory Yashchenko
Acoustics 2026, 8(1), 18; https://doi.org/10.3390/acoustics8010018 - 13 Mar 2026
Viewed by 503
Abstract
This study investigates the influence of microgravity on the fundamental frequency (F0) of astronauts’ speech. A speech corpus was compiled, including recordings in microgravity and on Earth, matched by speaker and content. The signal processing methodology included filtering with consideration of human auditory [...] Read more.
This study investigates the influence of microgravity on the fundamental frequency (F0) of astronauts’ speech. A speech corpus was compiled, including recordings in microgravity and on Earth, matched by speaker and content. The signal processing methodology included filtering with consideration of human auditory perception, segmentation of speech fragments, F0 estimation using digital signal processing techniques, and visualization through fundamental frequency dynamics plots. Results revealed a consistent increase in F0 for most astronauts under microgravity, with maximum values of 450 Hz for female speakers and 245 Hz for male speakers. Elevated F0 levels were observed for approximately 86% of the total duration of speech fragments recorded in microgravity, compared with 14% on Earth. These findings confirm that microgravity affects the speech apparatus and acoustic characteristics of voice. Practical implications include adapting voice-controlled systems and automatic speech recognition for space environments, monitoring crew condition, and studying speech physiology under extreme conditions. Full article
Show Figures

Figure 1

20 pages, 5360 KB  
Article
Experimental Investigation of Deviations in Sound Reproduction
by Paul Oomen, Bashar Farran, Luka Nadiradze, Máté Csanád and Amira Val Baker
Acoustics 2026, 8(1), 7; https://doi.org/10.3390/acoustics8010007 - 28 Jan 2026
Viewed by 1656
Abstract
Sound reproduction is the electro-mechanical re-creation of sound waves using analogue and digital audio equipment. Although sound reproduction implies that repeated acoustical events are close to identical, numerous fixed and variable conditions affect the acoustic result. To arrive at a better understanding of [...] Read more.
Sound reproduction is the electro-mechanical re-creation of sound waves using analogue and digital audio equipment. Although sound reproduction implies that repeated acoustical events are close to identical, numerous fixed and variable conditions affect the acoustic result. To arrive at a better understanding of the magnitude of deviations in sound reproduction, amplitude deviation and phase distortion of a sound signal were measured at various reproduction stages and compared under a set of controlled acoustical conditions, one condition being the presence of a human subject in the acoustic test environment. Deviations in electroacoustic reproduction were smaller than ±0.2 dB amplitude and ±3 degrees phase shift when comparing trials recorded on the same day (Δt < 8 h, mean uncertainty u = 1.58%). Deviations increased significantly with greater than two times the amplitude and three times the phase shift when comparing trials recorded on different days (Δt > 16 h, u = 4.63%). Deviations further increased significantly with greater than 15 times the amplitude and the phase shift when a human subject was present in the acoustic environment (u = 24.64%). For the first time, this study shows that the human body does not merely absorb but can also cause amplification of sound energy. The degree of attenuation or amplification per frequency shows complex variance depending on the type of reproduction and the subject, indicating a nonlinear dynamic interaction. The findings of this study may serve as a reference to update acoustical standards and improve accuracy and reliability of sound reproduction and its application in measurements, diagnostics and therapeutic methods. Full article
Show Figures

Figure 1

26 pages, 712 KB  
Article
Comparing Multi-Scale and Pipeline Models for Speaker Change Detection
by Alymzhan Toleu, Gulmira Tolegen and Bagashar Zhumazhanov
Acoustics 2026, 8(1), 5; https://doi.org/10.3390/acoustics8010005 - 25 Jan 2026
Viewed by 935
Abstract
Speaker change detection (SCD) in long, multi-party meetings is essential for diarization, Automatic speech recognition (ASR), and summarization, and is now often performed in the space of pre-trained speech embeddings. However, unsupervised approaches remain dominant when timely labeled audio is scarce, and their [...] Read more.
Speaker change detection (SCD) in long, multi-party meetings is essential for diarization, Automatic speech recognition (ASR), and summarization, and is now often performed in the space of pre-trained speech embeddings. However, unsupervised approaches remain dominant when timely labeled audio is scarce, and their behavior under a unified modeling setup is still not well understood. In this paper, we systematically compare two representative unsupervised approaches on the multi-talker audio meeting corpus: (i) a clustering-based pipeline that segments and clusters embeddings/features and scores boundaries via cluster changes and jump magnitude, and (ii) a multi-scale jump-based detector that measures embedding discontinuities at several window lengths and fuses them via temporal clustering and voting. Using a shared front-end and protocol, we vary the underlying features (ECAPA, WavLM, wav2vec 2.0, MFCC, and log-Mel) and test the model’s robustness under additive noise. The results show that embedding choice is crucial and that the two methods offer complementary trade-offs: the pipeline yields low false alarm rates but higher misses, while the multi-scale detector achieves relatively high recall at the cost of many false alarms. Full article
Show Figures

Figure 1

Back to TopTop