Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

Search Results (12)

Search Parameters:
Keywords = Speech-to-Text (STT)

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 2127 KB  
Article
Leveraging Large Language Models for Real-Time UAV Control
by Kheireddine Choutri, Samiha Fadloun, Ayoub Khettabi, Mohand Lagha, Souham Meshoul and Raouf Fareh
Electronics 2025, 14(21), 4312; https://doi.org/10.3390/electronics14214312 - 2 Nov 2025
Cited by 3 | Viewed by 3716
Abstract
As drones become increasingly integrated into civilian and industrial domains, the demand for natural and accessible control interfaces continues to grow. Conventional manual controllers require technical expertise and impose cognitive overhead, limiting their usability in dynamic and time-critical scenarios. To address these limitations, [...] Read more.
As drones become increasingly integrated into civilian and industrial domains, the demand for natural and accessible control interfaces continues to grow. Conventional manual controllers require technical expertise and impose cognitive overhead, limiting their usability in dynamic and time-critical scenarios. To address these limitations, this paper presents a multilingual voice-driven control framework for quadrotor drones, enabling real-time operation in both English and Arabic. The proposed architecture combines offline Speech-to-Text (STT) processing with large language models (LLMs) to interpret spoken commands and translate them into executable control code. Specifically, Vosk is employed for bilingual STT, while Google Gemini provides semantic disambiguation, contextual inference, and code generation. The system is designed for continuous, low-latency operation within an edge–cloud hybrid configuration, offering an intuitive and robust human–drone interface. While speech recognition and safety validation are processed entirely offline, high-level reasoning and code generation currently rely on cloud-based LLM inference. Experimental evaluation demonstrates an average speech recognition accuracy of 95% and end-to-end command execution latency between 300 and 500 ms, validating the feasibility of reliable, multilingual, voice-based UAV control. This research advances multimodal human–robot interaction by showcasing the integration of offline speech recognition and LLMs for adaptive, safe, and scalable aerial autonomy. Full article
Show Figures

Figure 1

6 pages, 188 KB  
Proceeding Paper
TTS and STT in Service of Education
by Zakaria El Fakir, Oussama Kaich, El Habib Benlahmar, Sanaa El Filali and Omar Zahour
Eng. Proc. 2025, 112(1), 4; https://doi.org/10.3390/engproc2025112004 - 14 Oct 2025
Viewed by 2558
Abstract
This article explores how Text-to-Speech (TTS) and Speech-to-Text (STT) technologies are being harnessed in education to enhance accessibility, language development, and overall learner engagement. Drawing upon theoretical frameworks in linguistics and educational psychology, we highlight the benefits TTS and STT can offer to [...] Read more.
This article explores how Text-to-Speech (TTS) and Speech-to-Text (STT) technologies are being harnessed in education to enhance accessibility, language development, and overall learner engagement. Drawing upon theoretical frameworks in linguistics and educational psychology, we highlight the benefits TTS and STT can offer to diverse student populations, including students with disabilities, language learners, and those seeking personalized or self-paced instruction. We discuss methods for integrating TTS and STT into the classroom (hardware, software, and practical considerations) and offer case studies of effective implementations in areas such as literacy support, foreign language acquisition, and assessment. We then address the pedagogical benefits these tools provide—such as differentiated instruction, immediate feedback, and a heightened sense of learner autonomy—along with limitations and challenges that educators may encounter. In conclusion, we suggest future directions for research and practice, underscoring the importance of teacher training, ethical considerations, and ever-evolving advancements in natural language processing. Full article
29 pages, 1708 KB  
Article
Speech Recognition and Synthesis Models and Platforms for the Kazakh Language
by Aidana Karibayeva, Vladislav Karyukin, Balzhan Abduali and Dina Amirova
Information 2025, 16(10), 879; https://doi.org/10.3390/info16100879 - 10 Oct 2025
Cited by 1 | Viewed by 5889
Abstract
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language [...] Read more.
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language with limited audio corpora, language models, and high-quality speech synthesis systems. This study provides a comprehensive analysis of existing speech recognition and synthesis models, emphasizing their applicability and adaptation to the Kazakh language. Special attention is given to linguistic and technical barriers, including the agglutinative structure, rich vowel system, and phonemic variability. Both open-source and commercial solutions were evaluated, including Whisper, GPT-4 Transcribe, ElevenLabs, OpenAI TTS, Voiser, KazakhTTS2, and TurkicTTS. Speech recognition systems were assessed using BLEU, WER, TER, chrF, and COMET, while speech synthesis was evaluated with MCD, PESQ, STOI, and DNSMOS, thus covering both lexical–semantic and acoustic–perceptual characteristics. The results demonstrate that, for speech-to-text (STT), the strongest performance was achieved by Soyle on domain-specific data (BLEU 74.93, WER 18.61), while Voiser showed balanced accuracy (WER 40.65–37.11, chrF 80.88–84.51) and GPT-4 Transcribe achieved robust semantic preservation (COMET up to 1.02). In contrast, Whisper performed weakest (WER 77.10, BLEU 13.22), requiring further adaptation for Kazakh. For text-to-speech (TTS), KazakhTTS2 delivered the most natural perceptual quality (DNSMOS 8.79–8.96), while OpenAI TTS achieved the best spectral accuracy (MCD 123.44–117.11, PESQ 1.14). TurkicTTS offered reliable intelligibility (STOI 0.15, PESQ 1.16), and ElevenLabs produced natural but less spectrally accurate speech. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

17 pages, 1022 KB  
Article
Accuracy of Speech-to-Text Transcription in a Digital Cognitive Assessment for Older Adults
by Ariel M. Gordon and Peter E. Wais
Brain Sci. 2025, 15(10), 1090; https://doi.org/10.3390/brainsci15101090 - 9 Oct 2025
Viewed by 1449
Abstract
Background/Objectives: Neuropsychological assessments are valuable tools for evaluating the cognitive performance of older adults. Limitations associated with these in-person paper-and-pencil tests have inspired efforts to develop digital assessments, which would expand access to cognitive screening. Digital tests, however, often lack validity relative to [...] Read more.
Background/Objectives: Neuropsychological assessments are valuable tools for evaluating the cognitive performance of older adults. Limitations associated with these in-person paper-and-pencil tests have inspired efforts to develop digital assessments, which would expand access to cognitive screening. Digital tests, however, often lack validity relative to gold-standard paper-and-pencil versions that have been robustly validated. Speech-to-text (STT) technology has the potential to improve the validity of digital tests through its ability to capture verbal responses, yet the effect of its performance on standardized scores used for cognitive characterization is unknown. Methods: The present study evaluated the accuracy of Apple’s STT engine relative to ground-truth transcriptions (RQ1), as well as the effect of the engine’s transcription errors on resulting standardized scores (RQ2). Our study analyzed data from 223 older adults who completed a digital assessment on an iPad that used STT to transcribe and score task responses. These automated transcriptions were then compared against ground-truth transcriptions that were human-corrected via external recordings. Results: Results showed differences between STT and ground-truth transcriptions (RQ1). Nevertheless, these differences were not large enough to practically affect standardized measures of cognitive performance (RQ2). Conclusions: Our results demonstrate the practical utility of Apple’s STT engine for digital neuropsychological assessment and cognitive characterization. These findings support the possibility that speech-to-text, with its ability to capture and process verbal responses, will be a viable tool for increasing the validity of digital neuropsychological assessments. Full article
(This article belongs to the Special Issue Perspectives of Artificial Intelligence (AI) in Aging Neuroscience)
Show Figures

Figure 1

32 pages, 3609 KB  
Article
BPMN-Based Design of Multi-Agent Systems: Personalized Language Learning Workflow Automation with RAG-Enhanced Knowledge Access
by Hedi Tebourbi, Sana Nouzri, Yazan Mualla, Meryem El Fatimi, Amro Najjar, Abdeljalil Abbas-Turki and Mahjoub Dridi
Information 2025, 16(9), 809; https://doi.org/10.3390/info16090809 - 17 Sep 2025
Cited by 1 | Viewed by 4568
Abstract
The intersection of Artificial Intelligence (AI) and education is revolutionizing learning and teaching in this digital era, with Generative AI and large language models (LLMs) providing even greater possibilities for the future. The digital transformation of language education demands innovative approaches that combine [...] Read more.
The intersection of Artificial Intelligence (AI) and education is revolutionizing learning and teaching in this digital era, with Generative AI and large language models (LLMs) providing even greater possibilities for the future. The digital transformation of language education demands innovative approaches that combine pedagogical rigor with explainable AI (XAI) principles, particularly for low-resource languages. This paper presents a novel methodology that integrates Business Process Model and Notation (BPMN) with Multi-Agent Systems (MAS) to create transparent, workflow-driven language tutors. Our approach uniquely embeds XAI through three mechanisms: (1) BPMN’s visual formalism that makes agent decision-making auditable, (2) Retrieval-Augmented Generation (RAG) with verifiable knowledge provenance from textbooks of the National Institute of Languages of Luxembourg, and (3) human-in-the-loop validation of both content and pedagogical sequencing. To ensure realism in learner interaction, we integrate speech-to-text and text-to-speech technologies, creating an immersive, human-like learning environment. The system simulates intelligent tutoring through agents’ collaboration and dynamic adaptation to learner progress. We demonstrate this framework through a Luxembourgish language learning platform where specialized agents (Conversational, Reading, Listening, QA, and Grammar) operate within BPMN-modeled workflows. The system achieves high response faithfulness (0.82) and relevance (0.85) according to RAGA metrics, while speech integration using Whisper STT and Coqui TTS enables immersive practice. Evaluation with learners showed 85.8% satisfaction with contextual responses and 71.4% engagement rates, confirming the effectiveness of our process-driven approach. This work advances AI-powered language education by showing how formal process modeling can create pedagogically coherent and explainable tutoring systems. The architecture’s modularity supports extension to other low-resource languages while maintaining the transparency critical for educational trust. Future work will expand curriculum coverage and develop teacher-facing dashboards to further improve explainability. Full article
(This article belongs to the Section Information Applications)
Show Figures

Figure 1

30 pages, 21387 KB  
Article
An Intelligent Docent System with a Small Large Language Model (sLLM) Based on Retrieval-Augmented Generation (RAG)
by Taemoon Jung and Inwhee Joe
Appl. Sci. 2025, 15(17), 9398; https://doi.org/10.3390/app15179398 - 27 Aug 2025
Cited by 6 | Viewed by 3439
Abstract
This study designed and empirically evaluated a method to enhance information accessibility for museum and art gallery visitors using a small Large Language Model (sLLM) based on the Retrieval-Augmented Generation (RAG) framework. Over 199,000 exhibition descriptions were collected and refined, and a question-answering [...] Read more.
This study designed and empirically evaluated a method to enhance information accessibility for museum and art gallery visitors using a small Large Language Model (sLLM) based on the Retrieval-Augmented Generation (RAG) framework. Over 199,000 exhibition descriptions were collected and refined, and a question-answering dataset consisting of 102,000 pairs reflecting user personas was constructed to develop DocentGemma, a domain-optimized language model. This model was fine-tuned through Low-Rank Adaptation (LoRA) based on Google’s Gemma2-9B and integrated with FAISS and OpenSearch-based document retrieval systems within the LangChain framework. Performance evaluation was conducted using a dedicated Q&A benchmark for the docent domain, comparing the model against five commercial and open-source LLMs (including GPT-3.5 Turbo, LLaMA3.3-70B, and Gemma2-9B). DocentGemma achieved an accuracy of 85.55% and a perplexity of 3.78, demonstrating competitive performance in language generation and response accuracy within the domain-specific context. To enhance retrieval relevance, a Spatio-Contextual Retriever (SC-Retriever) was introduced, which combines semantic similarity and spatial proximity based on the user’s query and location. An ablation study confirmed that integrating both modalities improved retrieval quality, with the SC-Retriever achieving a recall@1 of 53.45% and a Mean Reciprocal Rank (MRR) of 68.12, representing a 17.5 20% gain in search accuracy compared to baseline models such as GTE and SpatialNN. System performance was further validated through field deployment at three major exhibition venues in Seoul (the Seoul History Museum, the Hwan-ki Museum, and the Hanseong Baekje Museum). A user test involving 110 participants indicated high response credibility and an average satisfaction score of 4.24. To ensure accessibility, the system supports various output formats, including multilingual speech and subtitles. This work illustrates a practical application of integrating LLM-based conversational capabilities into traditional docent services and suggests potential for further development toward location-aware interactive systems and AI-driven cultural content services. Full article
Show Figures

Figure 1

10 pages, 724 KB  
Article
Real-Time Speech-to-Text on Edge: A Prototype System for Ultra-Low Latency Communication with AI-Powered NLP
by Stefano Di Leo, Luca De Cicco and Saverio Mascolo
Information 2025, 16(8), 685; https://doi.org/10.3390/info16080685 - 11 Aug 2025
Cited by 2 | Viewed by 11138
Abstract
This paper presents a real-time speech-to-text (STT) system designed for edge computing environments requiring ultra-low latency and local processing. Differently from cloud-based STT services, the proposed solution runs entirely on a local infrastructure which allows the enforcement of user privacy and provides high [...] Read more.
This paper presents a real-time speech-to-text (STT) system designed for edge computing environments requiring ultra-low latency and local processing. Differently from cloud-based STT services, the proposed solution runs entirely on a local infrastructure which allows the enforcement of user privacy and provides high performance in bandwidth-limited or offline scenarios. The designed system is based on a browser-native audio capture through WebRTC, real-time streaming with WebSocket, and offline automatic speech recognition (ASR) utilizing the Vosk engine. A natural language processing (NLP) component, implemented as a microservice, improves transcription results for spelling accuracy and clarity. Our prototype reaches sub-second end-to-end latency and strong transcription capabilities under realistic conditions. Furthermore, the modular architecture allows extensibility, integration of advanced AI models, and domain-specific adaptations. Full article
(This article belongs to the Section Information Applications)
Show Figures

Figure 1

6 pages, 175 KB  
Proceeding Paper
Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text
by Jalal El Bahri, Mohamed Kouissi and Mohammed Achkari Begdouri
Comput. Sci. Math. Forum 2025, 10(1), 6; https://doi.org/10.3390/cmsf2025010006 - 16 Jun 2025
Viewed by 4093
Abstract
This study investigates the energy consumption and carbon footprint of two prominent automatic speech recognition (ASR) systems: OpenAI’s Whisper and Google’s Speech-to-Text API. We evaluate both local and cloud-based speech recognition approaches using a public Kaggle dataset of 20,000 short audio clips in [...] Read more.
This study investigates the energy consumption and carbon footprint of two prominent automatic speech recognition (ASR) systems: OpenAI’s Whisper and Google’s Speech-to-Text API. We evaluate both local and cloud-based speech recognition approaches using a public Kaggle dataset of 20,000 short audio clips in Urdu, utilizing CodeCarbon, PyJoule, and PowerAPI for comprehensive energy profiling. As a result of our analysis, we expose some substantial differences between the two systems in terms of energy efficiency and carbon emissions, with the cloud-based solution showing substantially lower environmental impact despite comparable accuracy. We discuss the implications of these findings for sustainable AI deployment and minimizing the ecological footprint of speech recognition technologies. Full article
16 pages, 4485 KB  
Article
Auto-Scoring Feature Based on Sentence Transformer Similarity Check with Korean Sentences Spoken by Foreigners
by Aria Bisma Wahyutama and Mintae Hwang
Appl. Sci. 2023, 13(1), 373; https://doi.org/10.3390/app13010373 - 28 Dec 2022
Cited by 5 | Viewed by 4198
Abstract
This paper contains the development of a training service for foreigners to help them increase their ability to speak Korean. The service developed in this paper is implemented in the form of a mobile application that shows specific Korean sentences to the user [...] Read more.
This paper contains the development of a training service for foreigners to help them increase their ability to speak Korean. The service developed in this paper is implemented in the form of a mobile application that shows specific Korean sentences to the user for them to record themselves speaking the sentence. The objective is to generate the score automatically based on how similar the recorded voice with the actual sentence using Speech-To-Text (STT) engines and Sentence Transformers. The application is developed by selecting the four most commonly known STT engines with similar features, which are Google API, Microsoft Azure, Naver Clova, and IBM Watson, which are put into a Rest API along with the Sentence Transformer. The mobile application will record the user’s voice and send it to the Rest API. The STT engines will transcribe the file into a text and then feed it into a Sentence Transformer to generate the score based on their similarity. After measuring the response time and consistency as the performance evaluation by simulating a scenario using an Android emulator, Microsoft Azure with 1.13 s is found to be the fastest STT engine and Naver Clova is found to be the least consistent engine with nine different transcribe results. Full article
(This article belongs to the Special Issue Future Information & Communication Engineering 2022)
Show Figures

Figure 1

22 pages, 1693 KB  
Review
Arabic Automatic Speech Recognition: A Systematic Literature Review
by Amira Dhouib, Achraf Othman, Oussama El Ghoul, Mohamed Koutheair Khribi and Aisha Al Sinani
Appl. Sci. 2022, 12(17), 8898; https://doi.org/10.3390/app12178898 - 5 Sep 2022
Cited by 43 | Viewed by 15635
Abstract
Automatic Speech Recognition (ASR), also known as Speech-To-Text (STT) or computer speech recognition, has been an active field of research recently. This study aims to chart this field by performing a Systematic Literature Review (SLR) to give insight into the ASR studies proposed, [...] Read more.
Automatic Speech Recognition (ASR), also known as Speech-To-Text (STT) or computer speech recognition, has been an active field of research recently. This study aims to chart this field by performing a Systematic Literature Review (SLR) to give insight into the ASR studies proposed, especially for the Arabic language. The purpose is to highlight the trends of research about Arabic ASR and guide researchers with the most significant studies published over ten years from 2011 to 2021. This SLR attempts to tackle seven specific research questions related to the toolkits used for developing and evaluating Arabic ASR, the supported type of the Arabic language, the used feature extraction/classification techniques, the type of speech recognition, the performance of Arabic ASR, the existing gaps facing researchers, along with some future research. Across five databases, 38 studies met our defined inclusion criteria. Our results showed different open-source toolkits to support Arabic speech recognition. The most prominent ones were KALDI, HTK, then CMU Sphinx toolkits. A total of 89.47% of the retained studies cover modern standard Arabic, whereas 26.32% of them were dedicated to different dialects of Arabic. MFCC and HMM were presented as the most used feature extraction and classification techniques, respectively: 63% of the papers were based on MFCC and 21% were based on HMM. The review also shows that the performance of Arabic ASR systems depends mainly on different criteria related to the availability of resources, the techniques used for acoustic modeling, and the used datasets. Full article
(This article belongs to the Special Issue Automatic Speech Recognition)
Show Figures

Figure 1

13 pages, 5553 KB  
Article
Implementation of Detection System for Drowsy Driving Prevention Using Image Recognition and IoT
by Seok-Woo Jang and Byeongtae Ahn
Sustainability 2020, 12(7), 3037; https://doi.org/10.3390/su12073037 - 10 Apr 2020
Cited by 38 | Viewed by 11466
Abstract
In recent years, the casualties of traffic accidents caused by driving cars have been gradually increasing. In particular, there are more serious injuries and deaths than minor injuries, and the damage due to major accidents is increasing. In particular, heavy cargo trucks and [...] Read more.
In recent years, the casualties of traffic accidents caused by driving cars have been gradually increasing. In particular, there are more serious injuries and deaths than minor injuries, and the damage due to major accidents is increasing. In particular, heavy cargo trucks and high-speed bus accidents that occur during driving in the middle of the night have emerged as serious social problems. Therefore, in this study, a drowsiness prevention system was developed to prevent large-scale disasters caused by traffic accidents. In this study, machine learning was applied to predict drowsiness and improve drowsiness prediction using facial recognition technology and eye-blink recognition technology. Additionally, a CO2 sensor chip was used to detect additional drowsiness. Speech recognition technology can also be used to apply Speech to Text (STT), allowing a driver to request their desired music or make a call to avoid drowsiness while driving. Full article
(This article belongs to the Special Issue Big Data for Sustainable Anticipatory Computing)
Show Figures

Figure 1

19 pages, 4144 KB  
Article
An Audification and Visualization System (AVS) of an Autonomous Vehicle for Blind and Deaf People Based on Deep Learning
by Surak Son, YiNa Jeong and Byungkwan Lee
Sensors 2019, 19(22), 5035; https://doi.org/10.3390/s19225035 - 18 Nov 2019
Cited by 6 | Viewed by 5385
Abstract
When blind and deaf people are passengers in fully autonomous vehicles, an intuitive and accurate visualization screen should be provided for the deaf, and an audification system with speech-to-text (STT) and text-to-speech (TTS) functions should be provided for the blind. However, these systems [...] Read more.
When blind and deaf people are passengers in fully autonomous vehicles, an intuitive and accurate visualization screen should be provided for the deaf, and an audification system with speech-to-text (STT) and text-to-speech (TTS) functions should be provided for the blind. However, these systems cannot know the fault self-diagnosis information and the instrument cluster information that indicates the current state of the vehicle when driving. This paper proposes an audification and visualization system (AVS) of an autonomous vehicle for blind and deaf people based on deep learning to solve this problem. The AVS consists of three modules. The data collection and management module (DCMM) stores and manages the data collected from the vehicle. The audification conversion module (ACM) has a speech-to-text submodule (STS) that recognizes a user’s speech and converts it to text data, and a text-to-wave submodule (TWS) that converts text data to voice. The data visualization module (DVM) visualizes the collected sensor data, fault self-diagnosis data, etc., and places the visualized data according to the size of the vehicle’s display. The experiment shows that the time taken to adjust visualization graphic components in on-board diagnostics (OBD) was approximately 2.5 times faster than the time taken in a cloud server. In addition, the overall computational time of the AVS system was approximately 2 ms faster than the existing instrument cluster. Therefore, because the AVS proposed in this paper can enable blind and deaf people to select only what they want to hear and see, it reduces the overload of transmission and greatly increases the safety of the vehicle. If the AVS is introduced in a real vehicle, it can prevent accidents for disabled and other passengers in advance. Full article
(This article belongs to the Special Issue Smart Sensors and Devices in Artificial Intelligence)
Show Figures

Figure 1

Back to TopTop