Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (43)

Search Parameters:
Keywords = connectionist temporal classification

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
26 pages, 1497 KB  
Article
Lightweight End-to-End Diacritical Arabic Speech Recognition Using CTC-Transformer with Relative Positional Encoding
by Haifa Alaqel and Khalil El Hindi
Mathematics 2025, 13(20), 3352; https://doi.org/10.3390/math13203352 - 21 Oct 2025
Viewed by 175
Abstract
Arabic automatic speech recognition (ASR) faces distinct challenges due to its complex morphology, dialectal variations, and the presence of diacritical marks that strongly influence pronunciation and meaning. This study introduces a lightweight approach for diacritical Arabic ASR that employs a Transformer encoder architecture [...] Read more.
Arabic automatic speech recognition (ASR) faces distinct challenges due to its complex morphology, dialectal variations, and the presence of diacritical marks that strongly influence pronunciation and meaning. This study introduces a lightweight approach for diacritical Arabic ASR that employs a Transformer encoder architecture enhanced with Relative Positional Encoding (RPE) and Connectionist Temporal Classification (CTC) loss, eliminating the need for a conventional decoder. A two-stage training process was applied: initial pretraining on Modern Standard Arabic (MSA), followed by progressive three-phase fine-tuning on diacritical Arabic datasets. The proposed model achieves a WER of 22.01% on the SASSC dataset, improving over traditional systems (best 28.4% WER) while using only ≈14 M parameters. In comparison, XLSR-Large (300 M parameters) achieves a WER of 12.17% but requires over 20× more parameters and substantially higher training and inference costs. Although XLSR attains lower error rates, the proposed model is far more practical for resource-constrained environments, offering reduced complexity, faster training, and lower memory usage while maintaining competitive accuracy. These results show that encoder-only Transformers with RPE, combined with CTC training and systematic architectural optimization, can effectively model Arabic phonetic structure while maintaining computational efficiency. This work establishes a new benchmark for resource-efficient diacritical Arabic ASR, making the technology more accessible for real-world deployment. Full article
(This article belongs to the Section E1: Mathematics and Computer Science)
Show Figures

Figure 1

18 pages, 2065 KB  
Article
Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Symmetry 2025, 17(9), 1478; https://doi.org/10.3390/sym17091478 - 8 Sep 2025
Viewed by 731
Abstract
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction [...] Read more.
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction with two pho-neme-aware augmentation strategies. (1) Dynamic Boundary-Aligned Phoneme Dropout progressively removes entire IPA segments according to a curriculum schedule, simulating real-world phenomena such as elision, lenition, and tonal drift while ensuring training stability. (2) Phoneme-Aware SpecAugment confines all time- and frequency-masking operations within phoneme boundaries and prioritizes high-attention regions, thereby preserving intra-phonemic contours and formant integrity. Built on the Whistle encoder—which integrates a Conformer backbone, Connectionist Temporal Classification–Conditional Random Field (CTC-CRF) alignment, and a multi-lingual phonetic space—the approach requires only a grapheme-to-phoneme lexicon and Montreal Forced Aligner outputs, without any additional manual labeling. Experiments on the Cantonese subset of Common Voice demonstrate consistent gains: Dynamic Dropout alone reduces phoneme error rate (PER) from 17.8% to 16.7% with 50 h of speech and 16.4% to 15.1% with 100 h, while the combination of the two augmentations further lowers PER to 15.9%/14.4%. These results confirm that structure-aware phoneme-level perturbations provide an effective and low-cost solution for building robust Cantonese ASR systems under low-resource conditions. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

102 pages, 17708 KB  
Review
From Detection to Understanding: A Systematic Survey of Deep Learning for Scene Text Processing
by Zhandong Liu, Ruixia Song, Ke Li and Yong Li
Appl. Sci. 2025, 15(17), 9247; https://doi.org/10.3390/app15179247 - 22 Aug 2025
Viewed by 1946
Abstract
Scene text understanding, serving as a cornerstone technology for autonomous navigation, document digitization, and accessibility tools, has witnessed a paradigm shift from traditional methods relying on handcrafted features and multi-stage processing pipelines to contemporary deep learning frameworks capable of learning hierarchical representations directly [...] Read more.
Scene text understanding, serving as a cornerstone technology for autonomous navigation, document digitization, and accessibility tools, has witnessed a paradigm shift from traditional methods relying on handcrafted features and multi-stage processing pipelines to contemporary deep learning frameworks capable of learning hierarchical representations directly from raw image inputs. This survey distinctly categorizes modern scene text recognition (STR) methodologies into three principal paradigms: two-stage detection frameworks that employ region proposal networks for precise text localization, single-stage detectors designed to optimize computational efficiency, and specialized architectures tailored to handle arbitrarily shaped text through geometric-aware modeling techniques. Concurrently, an in-depth analysis of text recognition paradigms elucidates the evolutionary trajectory from connectionist temporal classification (CTC) and sequence-to-sequence models to transformer-based architectures, which excel in contextual modeling and demonstrate superior performance. In contrast to prior surveys, this work uniquely emphasizes several key differences and contributions. Firstly, it provides a comprehensive and systematic taxonomy of STR methods, explicitly highlighting the trade-offs between detection accuracy, computational efficiency, and geometric adaptability across different paradigms. Secondly, it delves into the nuances of text recognition, illustrating how transformer-based models have revolutionized the field by capturing long-range dependencies and contextual information, thereby addressing challenges in recognizing complex text layouts and multilingual scripts. Furthermore, the survey pioneers the exploration of critical research frontiers, such as multilingual text adaptation, enhancing model robustness against environmental variations (e.g., lighting conditions, occlusions), and devising data-efficient learning strategies to mitigate the dependency on large-scale annotated datasets. By synthesizing insights from technical advancements across 28 benchmark datasets and standardized evaluation protocols, this study offers researchers a holistic perspective on the current state-of-the-art, persistent challenges, and promising avenues for future research, with the ultimate goal of achieving human-level scene text comprehension. Full article
Show Figures

Figure 1

32 pages, 9129 KB  
Article
Detection and Recognition of Bilingual Urdu and English Text in Natural Scene Images Using a Convolutional Neural Network–Recurrent Neural Network Combination with a Connectionist Temporal Classification Decoder
by Khadija Tul Kubra, Muhammad Umair, Muhammad Zubair, Muhammad Tahir Naseem and Chan-Su Lee
Sensors 2025, 25(16), 5133; https://doi.org/10.3390/s25165133 - 19 Aug 2025
Viewed by 916
Abstract
Urdu and English are widely used for visual text communications worldwide in public spaces such as signboards and navigation boards. Text in such natural scenes contains useful information for modern-era applications such as language translation for foreign visitors, robot navigation, and autonomous vehicles, [...] Read more.
Urdu and English are widely used for visual text communications worldwide in public spaces such as signboards and navigation boards. Text in such natural scenes contains useful information for modern-era applications such as language translation for foreign visitors, robot navigation, and autonomous vehicles, highlighting the importance of extracting these texts. Previous studies focused on Urdu alone or printed text pasted manually on images and lacked sufficiently large datasets for effective model training. Herein, a pipeline for Urdu and English (bilingual) text detection and recognition in complex natural scene images is proposed. Additionally, a unilingual dataset is converted into a bilingual dataset and augmented using various techniques. For implementations, a customized convolutional neural network is used for feature extraction, a recurrent neural network (RNN) is used for feature learning, and connectionist temporal classification (CTC) is employed for text recognition. Experiments are conducted using different RNNs and hidden units, which yield satisfactory results. Ablation studies are performed on the two best models by eliminating model components. The proposed pipeline is also compared to existing text detection and recognition methods. The proposed models achieved average accuracies of 98.5% for Urdu character recognition, 97.2% for Urdu word recognition, and 99.2% for English character recognition. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Graphical abstract

17 pages, 439 KB  
Article
MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning
by Shad Torrie, Kimi Wright and Dah-Jye Lee
Electronics 2025, 14(12), 2310; https://doi.org/10.3390/electronics14122310 - 6 Jun 2025
Viewed by 2036
Abstract
Speech recognition approaches typically fall into three categories: audio, visual, and audio–visual. Visual speech recognition, or lip reading, is the most difficult because visual cues are ambiguous and data is scarce. To address these challenges, we present a new multi-task audio–visual speech recognition, [...] Read more.
Speech recognition approaches typically fall into three categories: audio, visual, and audio–visual. Visual speech recognition, or lip reading, is the most difficult because visual cues are ambiguous and data is scarce. To address these challenges, we present a new multi-task audio–visual speech recognition, or MultiAVSR, framework for training a model on all three types of speech recognition simultaneously primarily to improve visual speech recognition. Unlike prior works which use separate models or complex semi-supervision, our framework employs a supervised multi-task hybrid Connectionist Temporal Classification/Attention loss cutting training exaFLOPs to just 18% of that required by semi-supervised multitask models. MultiAVSR achieves state-of-the-art visual speech recognition word error rate of 21.0% on the LRS3-TED dataset. Furthermore, it exhibits robust generalization capabilities, achieving a remarkable 44.7% word error rate on the WildVSR dataset. Our framework also demonstrates reduced dependency on external language models, which is critical for real-time visual speech recognition. For the audio and audio–visual tasks, our framework improves the robustness under various noisy environments with average relative word error rate improvements of 16% and 31%, respectively. These improvements across the three tasks illustrate the robust results our supervised multi-task speech recognition framework enables. Full article
(This article belongs to the Special Issue Advances in Information, Intelligence, Systems and Applications)
Show Figures

Figure 1

15 pages, 1909 KB  
Article
Helium Speech Recognition Method Based on Spectrogram with Deep Learning
by Yonghong Chen, Shibing Zhang and Dongmei Li
Big Data Cogn. Comput. 2025, 9(5), 136; https://doi.org/10.3390/bdcc9050136 - 20 May 2025
Cited by 1 | Viewed by 708
Abstract
With the development of the marine economy and the increase in marine activities, deep saturation diving has gained significant attention. Helium speech communication is indispensable for saturation diving operations and is a critical technology for deep saturation diving, serving as the sole communication [...] Read more.
With the development of the marine economy and the increase in marine activities, deep saturation diving has gained significant attention. Helium speech communication is indispensable for saturation diving operations and is a critical technology for deep saturation diving, serving as the sole communication method to ensure the smooth execution of such operations. This study introduces deep learning into helium speech recognition and proposes a spectrogram-based dual-model helium speech recognition method. First, we extract the spectrogram features from the helium speech. Then, we combine a deep fully convolutional neural network with connectionist temporal classification (CTC) to form an acoustic model, in which the spectrogram features of helium speech are used as an input to convert speech signals into phonetic sequences. Finally, a maximum entropy hidden Markov model (MEMM) is employed as the language model to convert the phonetic sequences to word outputs, which is regarded as a dynamic programming problem. We use a Viterbi algorithm to find the optimal path to decode the phonetic sequences to word sequences. The simulation results show that the method can effectively recognize helium speech with a recognition rate of 97.89% for isolated words and 95.99% for continuous helium speech. Full article
Show Figures

Figure 1

31 pages, 6120 KB  
Article
Enhancing Security of Online Interfaces: Adversarial Handwritten Arabic CAPTCHA Generation
by Ghady Alrasheed and Suliman A. Alsuhibany
Appl. Sci. 2025, 15(6), 2972; https://doi.org/10.3390/app15062972 - 10 Mar 2025
Viewed by 1477
Abstract
With the increasing online activity of Arabic speakers, the development of effective CAPTCHAs (Completely Automated Public Turing Tests to Tell Computers and Humans Apart) tailored for Arabic users has become crucial. Traditional CAPTCHAs, however, are increasingly vulnerable to machine learning-based attacks. To address [...] Read more.
With the increasing online activity of Arabic speakers, the development of effective CAPTCHAs (Completely Automated Public Turing Tests to Tell Computers and Humans Apart) tailored for Arabic users has become crucial. Traditional CAPTCHAs, however, are increasingly vulnerable to machine learning-based attacks. To address this challenge, we introduce a method for generating adversarial handwritten Arabic CAPTCHAs that remain user-friendly yet difficult for machines to solve. Our approach involves synthesizing handwritten Arabic words using a simulation technique, followed by the application of five adversarial perturbation techniques: Expectation Over Transformation (EOT), Scaled Gaussian Translation with Channel Shifts (SGTCS), Jacobian-based Saliency Map Attack (JSMA), Immutable Adversarial Noise (IAN), and Connectionist Temporal Classification (CTC). Evaluation results demonstrate that JSMA provides the highest level of security, with 30% of meaningless word CAPTCHAs remaining completely unrecognized by automated systems falling to 6.66% for meaningful words. From a usability perspective, JSMA also achieves the highest accuracy rates, with 75.6% for meaningless words and 90.6% for meaningful words. Our work presents an effective strategy for enhancing the security of Arabic websites and online interfaces against bot attacks, contributing to the advancement of CAPTCHA systems. Full article
Show Figures

Figure 1

18 pages, 585 KB  
Article
Improving Diacritical Arabic Speech Recognition: Transformer-Based Models with Transfer Learning and Hybrid Data Augmentation
by Haifa Alaqel and Khalil El Hindi
Information 2025, 16(3), 161; https://doi.org/10.3390/info16030161 - 20 Feb 2025
Cited by 2 | Viewed by 2678
Abstract
Diacritical Arabic (DA) refers to Arabic text with diacritical marks that guide pronunciation and clarify meanings, making their recognition crucial for accurate linguistic interpretation. These diacritical marks (short vowels) significantly influence meaning and pronunciation, and their accurate recognition is vital for the effectiveness [...] Read more.
Diacritical Arabic (DA) refers to Arabic text with diacritical marks that guide pronunciation and clarify meanings, making their recognition crucial for accurate linguistic interpretation. These diacritical marks (short vowels) significantly influence meaning and pronunciation, and their accurate recognition is vital for the effectiveness of automatic speech recognition (ASR) systems, particularly in applications requiring high semantic precision, such as voice-enabled translation services. Despite its importance, leveraging advanced machine learning techniques to enhance ASR for diacritical Arabic has remained underexplored. A key challenge in developing DA ASR is the limited availability of training data. This study introduces a transformer-based approach leveraging transfer learning and data augmentation to address these challenges. Using a cross-lingual speech representation (XLSR) model pretrained on 53 languages, we fine-tune it on DA and integrate connectionist temporal classification (CTC) with transformers for improved performance. Data augmentation techniques, including volume adjustment, pitch shift, speed alteration, and hybrid strategies, further mitigate data limitations, significantly reducing word error rates (WER). Our methods achieve a WER of 12.17%, outperforming traditional ASR systems and setting a new benchmark for DA ASR. These findings demonstrate the potential of advanced machine learning to address longstanding challenges in DA ASR and enhance its accuracy. Full article
Show Figures

Figure 1

17 pages, 3114 KB  
Article
Real-Time Communication Aid System for Korean Dysarthric Speech
by Kwanghyun Park and Jungpyo Hong
Appl. Sci. 2025, 15(3), 1416; https://doi.org/10.3390/app15031416 - 30 Jan 2025
Cited by 1 | Viewed by 2104
Abstract
Dysarthria is a speech disorder characterized by difficulties in articulation and vocalization due to impaired control of the articulatory system. Around 30% of individuals with speech disorders have dysarthria, facing significant communication challenges. Existing assistive tools for dysarthria either require additional manipulation or [...] Read more.
Dysarthria is a speech disorder characterized by difficulties in articulation and vocalization due to impaired control of the articulatory system. Around 30% of individuals with speech disorders have dysarthria, facing significant communication challenges. Existing assistive tools for dysarthria either require additional manipulation or only provide word-level speech support, limiting their ability to support effective communication in real-world situations. Thus, this paper proposes a real-time communication aid system that converts sentence-level Korean dysarthric speech to non-dysarthric normal speech. The proposed system consists of two main parts in cascading form. Specifically, a Korean Automatic Speech Recognition (ASR) model is trained with dysarthric utterances using a conformer-based architecture and the graph transducer network–connectionist temporal classification algorithm, significantly enhancing recognition performance over previous models. Subsequently, a Korean Text-To-Speech (TTS) model based on Jointly Training FastSpeech2 and HiFi-GAN for end-to-end Text-to-Speech (JETS) is pipelined to synthesize high-quality non-dysarthric normal speech. These models are integrated into a single system on an app server, which receives 5–10 s of dysarthric speech and converts it to normal speech after 2–3 s. This can provide a practical communication aid for people with dysarthria. Full article
Show Figures

Graphical abstract

16 pages, 1512 KB  
Article
An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
by Yi Qin and Feifan Yu
Sensors 2025, 25(2), 341; https://doi.org/10.3390/s25020341 - 9 Jan 2025
Cited by 1 | Viewed by 1134
Abstract
The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end [...] Read more.
The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model for the North Shaanxi dialect, leveraging the Conformer architecture. To tailor the model to the coal mining context, we developed a specialized corpus reflecting the phonetic characteristics of the dialect and its usage in the industry. We investigated feature extraction techniques suitable for the North Shaanxi dialect, focusing on the unique pronunciation of initial consonants and vowels. A preprocessing module was designed to accommodate the dialect’s rapid speech tempo and polyphonic nature, enhancing recognition performance. To enhance the decoder’s text generation capability, we replaced the Conformer decoder with a Transformer architecture. Additionally, to mitigate the computational demands of the model, we incorporated Connectionist Temporal Classification (CTC) joint training for optimization. The experimental results on our self-established voice dataset for the Northern Shaanxi coal mining industry demonstrate that the proposed Conformer–Transformer–CTC model achieves a 9.2% and 10.3% reduction in the word error rate compared to the standalone Conformer and Transformer models, respectively, confirming the advancement of our method. The next step will involve researching how to improve the performance of dialect speech recognition by integrating external language models and extracting pronunciation features of different dialects, thereby achieving better recognition results. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

26 pages, 3823 KB  
Article
Enhanced Conformer-Based Speech Recognition via Model Fusion and Adaptive Decoding with Dynamic Rescoring
by Junhao Geng, Dongyao Jia, Zihao He, Nengkai Wu and Ziqi Li
Appl. Sci. 2024, 14(24), 11583; https://doi.org/10.3390/app142411583 - 11 Dec 2024
Cited by 1 | Viewed by 3032
Abstract
Speech recognition is widely applied in fields like security, education, and healthcare. While its development drives global information infrastructure and AI strategies, current models still face challenges such as overfitting, local optima, and inefficiencies in decoding accuracy and computational cost. These issues cause [...] Read more.
Speech recognition is widely applied in fields like security, education, and healthcare. While its development drives global information infrastructure and AI strategies, current models still face challenges such as overfitting, local optima, and inefficiencies in decoding accuracy and computational cost. These issues cause instability and long response times, hindering AI’s competitiveness. Therefore, addressing these technical bottlenecks is critical for advancing national scientific progress and global information infrastructure. In this paper, we propose improvements to the model structure fusion and decoding algorithms. First, based on the Conformer network and its variants, we introduce a weighted fusion method using training loss as an indicator, adjusting the weights, thresholds, and other related parameters of the fused models to balance the contributions of different model structures, thereby creating a more robust and generalized model that alleviates overfitting and local optima. Second, for the decoding phase, we design a dynamic adaptive decoding method that combines traditional decoding algorithms such as connectionist temporal classification and attention-based models. This ensemble approach enables the system to adapt to different acoustic environments, improving its robustness and overall performance. Additionally, to further optimize the decoding process, we introduce a penalty function mechanism as a regularization technique to reduce the model’s dependence on a single decoding approach. The penalty function limits the weights of decoding strategies to prevent over-reliance on any single decoder, thus enhancing the model’s generalization. Finally, we validate our model on the Librispeech dataset, a large-scale English speech corpus containing approximately 1000 h of audio data. Experimental results demonstrate that the proposed method achieves word error rates (WERs) of 3.92% and 4.07% on the development and test sets, respectively, significantly improving over single-model and traditional decoding methods. Notably, the method reduces WER by approximately 0.4% on complex datasets compared to several advanced mainstream models, underscoring its superior robustness and adaptability in challenging acoustic environments. The effectiveness of the proposed method in addressing overfitting and improving accuracy and efficiency during the decoding phase was validated, highlighting its significance in advancing speech recognition technology. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

20 pages, 1150 KB  
Article
MPSA-Conformer-CTC/Attention: A High-Accuracy, Low-Complexity End-to-End Approach for Tibetan Speech Recognition
by Changlin Wu, Huihui Sun, Kaifeng Huang and Long Wu
Sensors 2024, 24(21), 6824; https://doi.org/10.3390/s24216824 - 24 Oct 2024
Cited by 1 | Viewed by 2491
Abstract
This study addresses the challenges of low accuracy and high computational demands in Tibetan speech recognition by investigating the application of end-to-end networks. We propose a decoding strategy that integrates Connectionist Temporal Classification (CTC) and Attention mechanisms, capitalizing on the benefits of automatic [...] Read more.
This study addresses the challenges of low accuracy and high computational demands in Tibetan speech recognition by investigating the application of end-to-end networks. We propose a decoding strategy that integrates Connectionist Temporal Classification (CTC) and Attention mechanisms, capitalizing on the benefits of automatic alignment and attention weight extraction. The Conformer architecture is utilized as the encoder, leading to the development of the Conformer-CTC/Attention model. This model first extracts global features from the speech signal using the Conformer, followed by joint decoding of these features through CTC and Attention mechanisms. To mitigate convergence issues during training, particularly with longer input feature sequences, we introduce a Probabilistic Sparse Attention mechanism within the joint CTC/Attention framework. Additionally, we implement a maximum entropy optimization algorithm for CTC, effectively addressing challenges such as increased path counts, spike distributions, and local optima during training. We designate the proposed method as the MaxEnt-Optimized Probabilistic Sparse Attention Conformer-CTC/Attention Model (MPSA-Conformer-CTC/Attention). Experimental results indicate that our improved model achieves a word error rate reduction of 10.68% and 9.57% on self-constructed and open-source Tibetan datasets, respectively, compared to the baseline model. Furthermore, the enhanced model not only reduces memory consumption and training time but also improves generalization capability and accuracy. Full article
(This article belongs to the Special Issue New Trends in Biometric Sensing and Information Processing)
Show Figures

Figure 1

18 pages, 4420 KB  
Article
Machine Learning Approach for Arabic Handwritten Recognition
by A. M. Mutawa, Mohammad Y. Allaho and Monirah Al-Hajeri
Appl. Sci. 2024, 14(19), 9020; https://doi.org/10.3390/app14199020 - 6 Oct 2024
Cited by 3 | Viewed by 5067
Abstract
Text recognition is an important area of the pattern recognition field. Natural language processing (NLP) and pattern recognition have been utilized efficiently in script recognition. Much research has been conducted on handwritten script recognition. However, the research on the Arabic language for handwritten [...] Read more.
Text recognition is an important area of the pattern recognition field. Natural language processing (NLP) and pattern recognition have been utilized efficiently in script recognition. Much research has been conducted on handwritten script recognition. However, the research on the Arabic language for handwritten text recognition received little attention compared with other languages. Therefore, it is crucial to develop a new model that can recognize Arabic handwritten text. Most of the existing models used to acknowledge Arabic text are based on traditional machine learning techniques. Therefore, we implemented a new model using deep machine learning techniques by integrating two deep neural networks. In the new model, the architecture of the Residual Network (ResNet) model is used to extract features from raw images. Then, the Bidirectional Long Short-Term Memory (BiLSTM) and connectionist temporal classification (CTC) are used for sequence modeling. Our system improved the recognition rate of Arabic handwritten text compared to other models of a similar type with a character error rate of 13.2% and word error rate of 27.31%. In conclusion, the domain of Arabic handwritten recognition is advancing swiftly with the use of sophisticated deep learning methods. Full article
(This article belongs to the Special Issue Applied Intelligence in Natural Language Processing)
Show Figures

Figure 1

10 pages, 585 KB  
Technical Note
Text-Independent Phone-to-Audio Alignment Leveraging SSL (TIPAA-SSL) Pre-Trained Model Latent Representation and Knowledge Transfer
by Noé Tits, Prernna Bhatnagar and Thierry Dutoit
Acoustics 2024, 6(3), 772-781; https://doi.org/10.3390/acoustics6030042 - 29 Aug 2024
Cited by 1 | Viewed by 2242
Abstract
In this paper, we present a novel approach for text-independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (Wav2Vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model [...] Read more.
In this paper, we present a novel approach for text-independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (Wav2Vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model and a frame-level phoneme classifier trained using forced-alignment labels (using Montreal Forced Aligner) to produce multi-lingual phonetic representations, thus requiring minimal additional training. We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English, respectively. Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems. We leave experiments on other languages for future work but the design of the system makes it easily adaptable to other languages. Full article
(This article belongs to the Special Issue Developments in Acoustic Phonetic Research)
Show Figures

Figure 1

13 pages, 2651 KB  
Article
Speech Recognition for Air Traffic Control Utilizing a Multi-Head State-Space Model and Transfer Learning
by Haijun Liang, Hanwen Chang and Jianguo Kong
Aerospace 2024, 11(5), 390; https://doi.org/10.3390/aerospace11050390 - 14 May 2024
Cited by 1 | Viewed by 2300
Abstract
In the present study, a novel end-to-end automatic speech recognition (ASR) framework, namely, ResNeXt-Mssm-CTC, has been developed for air traffic control (ATC) systems. This framework is built upon the Multi-Head State-Space Model (Mssm) and incorporates transfer learning techniques. Residual Networks with Cardinality (ResNeXt) [...] Read more.
In the present study, a novel end-to-end automatic speech recognition (ASR) framework, namely, ResNeXt-Mssm-CTC, has been developed for air traffic control (ATC) systems. This framework is built upon the Multi-Head State-Space Model (Mssm) and incorporates transfer learning techniques. Residual Networks with Cardinality (ResNeXt) employ multi-layered convolutions with residual connections to augment the extraction of intricate feature representations from speech signals. The Mssm is endowed with specialized gating mechanisms, which incorporate parallel heads that acquire knowledge of both local and global temporal dynamics in sequence data. Connectionist temporal classification (CTC) is utilized in the context of sequence labeling, eliminating the requirement for forced alignment and accommodating labels of varying lengths. Moreover, the utilization of transfer learning has been shown to improve performance on the target task by leveraging knowledge acquired from a source task. The experimental results indicate that the model proposed in this study exhibits superior performance compared to other baseline models. Specifically, when pretrained on the Aishell corpus, the model achieves a minimum character error rate (CER) of 7.2% and 8.3%. Furthermore, when applied to the ATC corpus, the CER is reduced to 5.5% and 6.7%. Full article
Show Figures

Figure 1

Back to TopTop