Real-Time Speech-to-Text on Edge: A Prototype System for Ultra-Low Latency Communication with AI-Powered NLP
Abstract
1. Introduction
2. Background and Related Work
3. Materials and Methods
3.1. System Architecture Overview
3.2. Component Description
3.2.1. Client-Side Application
3.2.2. Backend Server (Node.js)
3.2.3. Speech Recognition Microservice (Vosk)
3.2.4. NLP Correction Microservice (Flask)
3.2.5. Communication Protocols
4. Implementation Details
4.1. Audio Workflow Overview
4.2. Buffer Management and Token Differentiation
4.3. Orthographic Correction
4.4. Communication and Protocol Design
5. Validation and Performance Evaluation
5.1. Experimental Setup
5.2. Latency Analysis
- Audio Conversion (WebM to WAV): 83–107 ms;
- STT Inference (Vosk): 260–969 ms;
- Orthographic Correction: consistently under 30 ms.
5.3. Accuracy Metrics and Scenarios
- WER: <5%;
- CER: 5–10%.
- Scenario A (clean, native English): WER 2.4%, CER 6.3%;
- Scenario B (Italian-accented English, quiet background): WER 4.6%, CER 9.2%;
- Scenario C (noisy, spontaneous speech): WER 4.9%, CER 9.8%.
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 12449–12460. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef] [PubMed]
- Khurana, D.; Koli, A.; Khatter, K.; Singh, S. Natural language processing: State of the art, current trends and challenges. Multimed. Tools Appl. 2022, 82, 3713–3744. [Google Scholar] [CrossRef] [PubMed]
- WebRTC—Realtime Communication for the Open Web Platform. Available online: https://queue.acm.org/detail.cfm?id=3457587 (accessed on 7 August 2025).
- De Cicco, L.; Carlucci, G.; Mascolo, S. Congestion Control for WebRTC: Standardization Status and Open Issues. IEEE Commun. Stand. Mag. 2017, 1, 22–27. [Google Scholar] [CrossRef]
- Embedded Speech Technology. Available online: https://hdl.handle.net/20.500.12608/55116 (accessed on 7 August 2025).
- Hasnat, A.; Mowla, J.; Khan, M. Isolated and Continuous Bangla Speech Recognition: Implementation, Performance and Application Perspective. 2007. Available online: http://hdl.handle.net/10361/331 (accessed on 7 August 2025).
- Graves, A. Sequence Transduction with Recurrent Neural Networks. arXiv 2012, arXiv:1211.3711. [Google Scholar] [CrossRef]
- Wav2vec 2.0: Learning the Structure of Speech from Raw Audio. 2020. Available online: https://ai.meta.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/ (accessed on 7 August 2025).
- Carlucci, G.; De Cicco, L.; Holmer, S.; Mascolo, S. Congestion Control for Web Real-Time Communication. IEEE/ACM Trans. Netw. 2017, 25, 2629–2642. [Google Scholar] [CrossRef]
- Puspitasari, A.A.; An, T.T.; Alsharif, M.H.; Lee, B.M. Emerging Technologies for 6G Communication Networks: Machine Learning Approaches. Sensors 2023, 23, 7709. [Google Scholar] [CrossRef] [PubMed]
- Mhedhbi, M.; Elayoubi, S.; Leconte, G. AI-based prediction for Ultra Reliable Low Latency service performance in industrial environments. In Proceedings of the 2022 18th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), Thessaloniki, Greece, 10–12 October 2022; pp. 130–135. [Google Scholar] [CrossRef]
- Barański, M.; Jasiński, J.; Bartolewska, J.; Kacprzak, S.; Witkowski, M.; Kowalczyk, K. Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Docs, M.W. MediaRecorder API. Browser API for Recording Media Streams. 2023. Available online: https://developer.mozilla.org/en-US/docs/Web/API/MediaRecorder (accessed on 7 August 2025).
- Foundation, O. Node.js. JavaScript Runtime Built on Chrome’s V8 Engine. 2023. Available online: https://nodejs.org (accessed on 7 August 2025).
- Developers, F. FFmpeg. A Complete, Cross-Platform Solution to Record, Convert and Stream Audio and Video. 2023. Available online: https://ffmpeg.org/ (accessed on 7 August 2025).
- Cephei, A. Vosk Speech Recognition Toolkit. Open-Source Offline Speech Recognition Toolkit. 2023. Available online: https://alphacephei.com/vosk (accessed on 7 August 2025).
- Sedgewick, E. Autocorrect Python Library, Version 2.6.1. 2020. Available online: https://github.com/filyp/autocorrect (accessed on 7 August 2025).
- Rauch, G.; Contributors. Socket.IO: Real-Time Bidirectional Event-Based Communication. 2023. Available online: https://socket.io (accessed on 7 August 2025).
- Pimentel, V.; Nickerson, B.G. Communicating and Displaying Real-Time Data with WebSocket. IEEE Internet Comput. 2012, 16, 45–53. [Google Scholar] [CrossRef]
- Mozilla. Common Voice Dataset. 2020. Available online: https://commonvoice.mozilla.org (accessed on 7 August 2025).
Chunk | Conversion (ms) | STT Delay (ms) | Correction (ms) | Total (ms) |
---|---|---|---|---|
1 | 107 | 260 | – | 367 |
2 | 83 | 463 | – | 546 |
3 | 93 | 682 | – | 775 |
4 | 93 | 890 | 20 | 1003 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Di Leo, S.; De Cicco, L.; Mascolo, S. Real-Time Speech-to-Text on Edge: A Prototype System for Ultra-Low Latency Communication with AI-Powered NLP. Information 2025, 16, 685. https://doi.org/10.3390/info16080685
Di Leo S, De Cicco L, Mascolo S. Real-Time Speech-to-Text on Edge: A Prototype System for Ultra-Low Latency Communication with AI-Powered NLP. Information. 2025; 16(8):685. https://doi.org/10.3390/info16080685
Chicago/Turabian StyleDi Leo, Stefano, Luca De Cicco, and Saverio Mascolo. 2025. "Real-Time Speech-to-Text on Edge: A Prototype System for Ultra-Low Latency Communication with AI-Powered NLP" Information 16, no. 8: 685. https://doi.org/10.3390/info16080685
APA StyleDi Leo, S., De Cicco, L., & Mascolo, S. (2025). Real-Time Speech-to-Text on Edge: A Prototype System for Ultra-Low Latency Communication with AI-Powered NLP. Information, 16(8), 685. https://doi.org/10.3390/info16080685