Next Article in Journal
A Study on the Coarse-to-Fine Error Decomposition and Compensation Method of Free-Form Surface Machining
Previous Article in Journal
Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language

Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, 050711 Bucharest, Romania
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(19), 9043; https://doi.org/10.3390/app14199043
Submission received: 21 August 2024 / Revised: 20 September 2024 / Accepted: 2 October 2024 / Published: 7 October 2024
(This article belongs to the Special Issue Natural Language Processing in the Era of Artificial Intelligence)

Abstract

This paper introduces the USPDATRO dataset. This is a speech dataset, in the Romanian language, constructed from open data, focusing on under-represented voice types (children, young and old people, and female voices). The paper covers the methodology behind the dataset construction, specific details regarding the dataset, and evaluation of existing Romanian Automatic Speech Recognition (ASR) systems, with different architectures. Results indicate that more under-represented speech content is needed in the training of ASR systems. Our approach can be extended to other low-resourced languages, as long as open data are available.
Keywords: speech dataset; under-represented voices; speech recognition; Romanian language speech dataset; under-represented voices; speech recognition; Romanian language

Share and Cite

MDPI and ACS Style

Păiș, V.; Barbu Mititelu, V.; Irimia, E.; Ion, R.; Tufiș, D. Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language. Appl. Sci. 2024, 14, 9043. https://doi.org/10.3390/app14199043

AMA Style

Păiș V, Barbu Mititelu V, Irimia E, Ion R, Tufiș D. Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language. Applied Sciences. 2024; 14(19):9043. https://doi.org/10.3390/app14199043

Chicago/Turabian Style

Păiș, Vasile, Verginica Barbu Mititelu, Elena Irimia, Radu Ion, and Dan Tufiș. 2024. "Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language" Applied Sciences 14, no. 19: 9043. https://doi.org/10.3390/app14199043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop