Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language

Păiș, Vasile; Barbu Mititelu, Verginica; Irimia, Elena; Ion, Radu; Tufiș, Dan

doi:10.3390/app14199043

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language

by

Vasile Păiș

^*

,

Verginica Barbu Mititelu

,

Elena Irimia

,

Radu Ion

and

Dan Tufiș

Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, 050711 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 9043; https://doi.org/10.3390/app14199043

Submission received: 21 August 2024 / Revised: 20 September 2024 / Accepted: 2 October 2024 / Published: 7 October 2024

(This article belongs to the Special Issue Natural Language Processing in the Era of Artificial Intelligence)

Download Review Reports Versions Notes

Abstract

This paper introduces the USPDATRO dataset. This is a speech dataset, in the Romanian language, constructed from open data, focusing on under-represented voice types (children, young and old people, and female voices). The paper covers the methodology behind the dataset construction, specific details regarding the dataset, and evaluation of existing Romanian Automatic Speech Recognition (ASR) systems, with different architectures. Results indicate that more under-represented speech content is needed in the training of ASR systems. Our approach can be extended to other low-resourced languages, as long as open data are available.

Keywords: speech dataset; under-represented voices; speech recognition; Romanian language

Share and Cite

MDPI and ACS Style

Păiș, V.; Barbu Mititelu, V.; Irimia, E.; Ion, R.; Tufiș, D. Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language. Appl. Sci. 2024, 14, 9043. https://doi.org/10.3390/app14199043

AMA Style

Păiș V, Barbu Mititelu V, Irimia E, Ion R, Tufiș D. Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language. Applied Sciences. 2024; 14(19):9043. https://doi.org/10.3390/app14199043

Chicago/Turabian Style

Păiș, Vasile, Verginica Barbu Mititelu, Elena Irimia, Radu Ion, and Dan Tufiș. 2024. "Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language" Applied Sciences 14, no. 19: 9043. https://doi.org/10.3390/app14199043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language

Abstract

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI