Deep Learning for Speech, Image and Language Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 31 December 2024 | Viewed by 2568

Special Issue Editor


E-Mail Website1 Website2
Guest Editor
Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
Interests: deep learning; machine learning; artificial intelligence; speech processing
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Deep learning is an essential technology in various application area. It started to show performances comparable to that of humans in audio, image, video, and natural language processing applications. Recently, spoken language translation and multimodal large language model technologies provide new interfaces and methods for inputting, manipulating, and generating text, sound, image, and video using computers. This special issue will be dedicated to the state-of-the-art research articles as well as tutorials and reviews in the field of deep learning for speech, image, and language processing.

Prof. Dr. Dongsuk Yook
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • speech processing
  • image processing
  • signal processing
  • language processing
  • applications and theories of deep learning

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

14 pages, 3407 KiB  
Article
An Audio Copy-Move Forgery Localization Model by CNN-Based Spectral Analysis
by Wei Zhao, Yujin Zhang, Yongqi Wang and Shiwen Zhang
Appl. Sci. 2024, 14(11), 4882; https://doi.org/10.3390/app14114882 - 4 Jun 2024
Viewed by 552
Abstract
In audio copy-move forgery forensics, existing traditional methods typically first segment audio into voiced and silent segments, then compute the similarity between voiced segments to detect and locate forged segments. However, audio collected in noisy environments is difficult to segment and manually set, [...] Read more.
In audio copy-move forgery forensics, existing traditional methods typically first segment audio into voiced and silent segments, then compute the similarity between voiced segments to detect and locate forged segments. However, audio collected in noisy environments is difficult to segment and manually set, and heuristic similarity thresholds lack robustness. Existing deep learning methods extract features from audio and then use neural networks for binary classification, lacking the ability to locate forged segments. Therefore, for locating audio copy-move forgery segments, we have improved deep learning methods and proposed a robust localization model by CNN-based spectral analysis. In the localization model, the Feature Extraction Module extracts deep features from Mel-spectrograms, while the Correlation Detection Module automatically decides on the correlation between these deep features. Finally, the Mask Decoding Module visually locates the forged segments. Experimental results show that compared to existing methods, the localization model improves the detection accuracy of audio copy-move forgery by 3.0–6.8%and improves the average detection accuracy of forged audio with post-processing attacks such as noise, filtering, resampling, and MP3 compression by over 7.0%. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

18 pages, 3890 KiB  
Article
Pyramid Feature Attention Network for Speech Resampling Detection
by Xinyu Zhou, Yujin Zhang, Yongqi Wang, Jin Tian and Shaolun Xu
Appl. Sci. 2024, 14(11), 4803; https://doi.org/10.3390/app14114803 - 1 Jun 2024
Viewed by 314
Abstract
Speech forgery and tampering, increasingly facilitated by advanced audio editing software, pose significant threats to the integrity and privacy of digital speech avatars. Speech resampling is a post-processing operation of various speech-tampering means, and the forensic detection of speech resampling is of great [...] Read more.
Speech forgery and tampering, increasingly facilitated by advanced audio editing software, pose significant threats to the integrity and privacy of digital speech avatars. Speech resampling is a post-processing operation of various speech-tampering means, and the forensic detection of speech resampling is of great significance. For speech resampling detection, most of the previous works used traditional methods of feature extraction and classification to distinguish original speech from forged speech. In view of the powerful ability of deep learning to extract features, this paper converts the speech signal into a spectrogram with time-frequency characteristics, and uses the feature pyramid network (FPN) with the Squeeze and Excitation (SE) attention mechanism to learn speech resampling features. The proposed method combines the low-level location information and the high-level semantic information, which dramatically improves the detection performance of speech resampling. Experiments were carried out on a resampling corpus made on the basis of the TIMIT dataset. The results indicate that the proposed method significantly improved the detection accuracy of various resampled speech. For the tampered speech with a resampling factor of 0.9, the detection accuracy is increased by nearly 20%. In addition, the robustness test demonstrates that the proposed model has strong resistance to MP3 compression, and the overall performance is better than the existing methods. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

13 pages, 324 KiB  
Article
Branch-Transformer: A Parallel Branch Architecture to Capture Local and Global Features for Language Identification
by Zeen Li, Shuanghong Liu, Zhihua Fang and Liang He
Appl. Sci. 2024, 14(11), 4681; https://doi.org/10.3390/app14114681 - 29 May 2024
Viewed by 404
Abstract
Currently, an increasing number of people are opting to use transformer models or conformer models for language identification, achieving outstanding results. Among them, transformer models based on self-attention can only capture global information, lacking finer local details. There are also approaches that employ [...] Read more.
Currently, an increasing number of people are opting to use transformer models or conformer models for language identification, achieving outstanding results. Among them, transformer models based on self-attention can only capture global information, lacking finer local details. There are also approaches that employ conformer models by concatenating convolutional neural networks and transformers to capture both local and global information. However, this static single-branch architecture is difficult to interpret and modify, and it incurs greater inference difficulty and computational costs compared to dual-branch models. Therefore, in this paper, we propose a novel model called Branch-transformer (B-transformer). In contrast to traditional transformers, it consists of parallel dual-branch structures. One branch utilizes self-attention to capture global information, while the other employs a Convolutional Gated Multi-Layer Perceptron (cgMLP) module to extract local information. We also investigate various fusion methods for integrating global and local information and experimentally validate the effectiveness of our approach on the NIST LRE 2017 dataset. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

14 pages, 1802 KiB  
Article
Wav2wav: Wave-to-Wave Voice Conversion
by Changhyeon Jeong, Hyung-pil Chang, In-Chul Yoo and Dongsuk Yook
Appl. Sci. 2024, 14(10), 4251; https://doi.org/10.3390/app14104251 - 17 May 2024
Viewed by 967
Abstract
Voice conversion is the task of changing the speaker characteristics of input speech while preserving its linguistic content. It can be used in various areas, such as entertainment, medicine, and education. The quality of the converted speech is crucial for voice conversion algorithms [...] Read more.
Voice conversion is the task of changing the speaker characteristics of input speech while preserving its linguistic content. It can be used in various areas, such as entertainment, medicine, and education. The quality of the converted speech is crucial for voice conversion algorithms to be useful in these various applications. Deep learning-based voice conversion algorithms, which have been showing promising results recently, generally consist of three modules: a feature extractor, feature converter, and vocoder. The feature extractor accepts the waveform as the input and extracts speech feature vectors for further processing. These speech feature vectors are later synthesized back into waveforms by the vocoder. The feature converter module performs the actual voice conversion; therefore, many previous studies separately focused on improving this module. These works combined the separately trained vocoder to synthesize the final waveform. Since the feature converter and the vocoder are trained independently, the output of the converter may not be compatible with the input of the vocoder, which causes performance degradation. Furthermore, most voice conversion algorithms utilize mel-spectrogram-based speech feature vectors without modification. These feature vectors have performed well in a variety of speech-processing areas but could be further optimized for voice conversion tasks. To address these problems, we propose a novel wave-to-wave (wav2wav) voice conversion method that integrates the feature extractor, the feature converter, and the vocoder into a single module and trains the system in an end-to-end manner. We evaluated the efficiency of the proposed method using the VCC2018 dataset. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

Back to TopTop