1. Introduction
Speech and Language Technology (SLT) solutions for children are useful for several applications, e.g., conversational interfaces for various applications, technologies for diagnosis and the treatment of a variety of developmental disorders, and in education and learning [
1,
2]. However, research and development on children’s speech-driven SLT applications are lagging behind those for adults’ speech. There are several reasons for this. Children’s speech is known to be different from adults’ speech in many aspects, including acoustic, prosodic, lexical, morphosyntactic, and pragmatic aspects, which are caused by physiological differences (e.g., shorter vocal tract lengths), cognitive/developmental differences (e.g., different stages of language acquisition), and behavioral differences [
3,
4,
5]. For instance, children’s speech exhibits increased magnitude and variability of temporal and spectral parameters in vowels, such as duration, fundamental frequency (F0), and formants (F1–F3), compared to adults’ speech. When around 15 years of age, children’s speech starts to resemble that of adults [
6], which implies that over the course of the years children’s speech changes and increasingly becomes more ‘adult-like’. Moreover, the vocal tract of a child is not just a smaller version of an adult vocal tract [
7]. Since acoustic features that are used for speech processing, such as the Mel-frequency Cepstral Coefficients (MFCCs) [
8], are based on a model of the adult vocal tract, acoustic features might not capture the underlying child vocal tract well. Another major reason for SLT applications and research on children’s speech being less well developed than those for adults’ speech is the limited availability of children’s speech datasets. This scarcity is partly due to stricter privacy standards associated with collecting and sharing children’s data [
9,
10]. The shortage of annotated children’s speech data makes it a low-resource problem. These issues make the research and development of Children’s Speech Recognition (CSR) systems challenging.
In this study, our main aim is to improve the performance of Automatic Speech Recognition (ASR) systems for children’s speech in the absence of any children’s training data (speech and text)—a situation occurring for many languages in the world and leading to large performance gaps between adults’ and children’s speech recognition performance. For instance, the authors of [
11] showed that an End-to-End (E2E) transformer-based ASR System (without a Language Model (LM)) trained on English adults’ speech of Librispeech achieved a 2.89% Word Error Rate (WER) for adults, but the WER increased to 38.8% on the MyST corpus and reached 87.2% on the OGI Kids Corpus. These high error rates are dependent on different factors, including age and speech types, and are likely to be similar for other languages. We aim to improve children’s speech recognition performance by tackling the two biggest problems outlined above: the mismatch between adults’ and children’s speech, i.e., the variability in children’s speech on one hand, and data scarcity on the other hand.
The mismatch between adults’ and children’s speech and variability in children’s speech is often addressed by capturing the acoustic variability in children’s speech through improved acoustic features (see also
Section 2.2). For example, Vocal Tract Length Normalization (VTLN) [
12] has been widely used to reduce the acoustic feature mismatch between adults’ and children’s speech due to vocal tract length variations [
13,
14]. Children typically have a shorter vocal tract length, resulting in higher-frequency sounds; the VTLN technique normalizes speech features based on the estimated warping factors, which account for the variations in vocal tract length. To tackle the data scarcity issue, typically, adults’ speech training data are acoustically modified to resemble children’s speech (see also
Section 2.2), for instance, through pitch modification [
15], spectral modification [
16], voice conversion [
17], and speed perturbations (SP) [
18], and the additional, modified speech is used as (additional) training material. The chosen scenario, that no child data are available, also means that no LM will be used. Note that although LM integration could enhance performance, especially in cases of read speech by adult speakers [
19], it might not effectively model the unique patterns found in children’s speech as the grammar and structure of children’s speech is different from that of adults. This was also observed in [
11], where it was observed that an E2E transformer-based ASR trained on the adults’ speech from Librispeech, without an LM, outperformed an E2E ASR with an LM incorporated when tested on children’s speech from the MyST corpus and OGI Kids corpus.
In this study, we focus on E2E models for the recognition performance advantages they provide over hybrid models [
20] and investigate well-known approaches in hybrid modeling for their potential in E2E children speech recognition. We compare the effectiveness of VTLN and two specifically chosen data augmentation techniques: Speed Perturbations (SP) [
21] and Spectral Augmentation (SpecAug) [
22]. Speed perturbation is chosen as it allows for mimicking a child’s higher pitch and slower speaking rate compared to adults: increasing the speed rate of adults’ speech increases the pitch and lowering the speed mimics the slower speaking style of children. Speed perturbation might thus mimic parts of children’s speech. Spectral augmentation was chosen as it has often been found to make the ASR system more robust to non-read speech [
23]. Research has shown that not all data augmentation techniques work for all types of diverse speech [
23]; we therefore, use these commonly used augmentation approach in E2E systems, such as SP, rather than any specific pitch or frequency modification to investigate the effect of a common augmentation approach on diverse speech to further understand the limitations and possibilities of existing data augmentation techniques on diverse speech. We study the effect of augmentations (SP and SpecAug) and normalization (VTLN) separately and together. The augmentation and normalization techniques are evaluated on both children’s and adults’ speech, with the goal of improving children’s speech recognition performance while maintaining performance on adults’ speech when adapting the model to children’s speech.
Summarizing, in this work our contributions are: (1) We assess the effectiveness and language independence of the data augmentation and VTLN approaches within E2E systems for three distinct languages: Dutch and German, two closely related Germanic languages, and Mandarin, an unrelated Asian tone language. (2) We analyze the effects of data augmentation and VTLN for different child age groups and gender categories. (3) Previous work on E2E models of adults’ speech recognition showed that a VTLN filter-bank front-end provides better performance than the original filter-bank features [
24]. VTLN’s potential benefits in the context of E2E children’s speech recognition are explored here for the first time. (4) Where typically VTLN models are trained on the same adult or children’s speech data as the ASR model is trained on, we explore various types of training data for VTLN model training. Moreover, we assess the warping factors and effectiveness of using adult and/or children’s speech as VTLN training data, as well as monolingual versus multilingual, multi-speaker speech data for VTLN training. (5) VTLN can be applied during training and testing or during testing alone, with potentially different results [
25]. We investigate the effect of applying VTLN during training and testing and only during testing for E2E children’s speech recognition across different languages and different speech styles. Our work can thus be considered a baseline or benchmark in E2E modeling using VTLN for children speech recognition in the absence of children’s speech data for acoustic model and language model training as it provides comprehensive results and comparisons across different languages and age groups.
5. General Discussion and Conclusions
In this study, we investigated data augmentation (speed perturbations and spectral augmentation) and feature normalization techniques (vocal tract length normalization) for E2E children’s speech recognition in the scenario that there are no children’s speech and text data available for (re)training the ASR system. We investigated the effect of these three approaches in isolation and together across three different languages, different speaking styles, and different children/teenager age groups and compared the results to those for adults’ speech. For these languages, the baseline ASR models trained on adults’ speech achieved a WER of <10% on adults’ speech and were found to be close to or even better than state-of-the-art results for the respective data sets/languages, and they outperformed the state-of-the-art OpenAI-Whisper small and medium models (and even the large models for Dutch and Mandarin). However, performance deteriorated substantially when tested on children’s speech. For Dutch, a drop of 30% absolute was observed for read speech and of over 40% absolute for the more spontaneous human–machine interaction speech. For German continuous speech, the deterioration was close to 70% absolute. For Mandarin, the picture was slightly different. Here, there was no performance drop from adult read speech to children’s read speech; however, a drop of around 16% absolute (around 50% relative) was observed for the children’s speech data set, which consisted of spontaneous speech, including that of younger children than in the read speech set.
The lack of performance drop from adult to children’s read speech for Mandarin is likely at least partially explained by the fact that for Mandarin, the adults’ and children’s speech sets are part of the same corpus, with the same recording conditions, while for Dutch and German the adults’ and children’s speech came from separate databases. To assess the impact of database mismatch, we tested the SLT Mandarin SP + SpecAug model (without LM) on three other Mandarin read speech databases: Magic data [
75] (43 k utterances, 52 h, 78 speakers), Aishell [
76] (7 k utterances, 10 h, 20 speakers), and THCHS-30 [
77] (2 k utterances, 6.3 h, 10 speakers), and we obtained 2.48%, 4.10%, and 14.25% CER, respectively. The results for Magic data and Aishell are good despite database mismatch, suggesting that the relatively easy recognition task of read speech may counterbalance the effect of database mismatch, which is in line with the results for our adults’ and children’s Mandarin read speech. However, THCHS-30, which consists of longer utterances, shows a drop in performance. This shows that the impact of database mismatch (also) depends on the specific database characteristics. Given that for both Dutch and German the children’s speech is partially or entirely non-read speech, the observed drop in recognition performance from adult to children’s speech is only partially explained by the database mismatch and is thus also due to the acoustic differences between adults’ and children’s speech.
Similar to what has been observed before in a Mandarin E2E system for children’s speech in literature [
18], applying speed perturbations reduced the WERs for children’s speech recognition. Performance was further improved when SpecAug was added. The beneficial effect of SpecAug is in line with findings for English children’s speech, which showed an improvement when applying SpecAug over a condition without SpecAug [
78]. Our results confirm and extend these earlier findings with a few observations: we observed improvements when using speed-perturbed adults’ speech for children’s speech recognition. We attribute this to the pitch and speed changes caused by the speed perturbations, which make the adults’ speech more similar to children’s speech. However, this positive effect of adding perturbed adults’ speech was only observed for native speakers and was absent for Dutch non-native speakers. Applying SpecAug led to performance improvements for all speaker groups, with a more substantial impact on non-read speech types. This emphasizes that augmentation techniques may not always and uniformly enhance performance but rather depend on specific characteristics of the speaker group and speech type, which is in line with findings from [
23].
As far as we are aware, we are the first to apply VTLN to adults’ speech for the improvement of children’s speech recognition in E2E models. Our results showed that the application of VTLN improved children’s speech recognition across the board both when applying models trained on adults’ speech only and when trained on children’s speech (from the same database as the test data) only; however, the improvement was smaller than for the combined SP and SpecAug data augmentation methods. The combination of SP, SpecAug, and VTLN, however, gave the best children’s speech recognition results for all three languages. Similar to what has been found for hybrid models [
50], VTLN, even when trained on adults’ speech only, thus also improves the recognition performance of children’s speech in the absence of children’s speech training data in E2E models without any language model. This result not only shows that VTLN provides a complementary approach and improvement to data augmentation but also that the same approach can be used across languages to improve children’s speech recognition. Moreover, since we tested different types of speech (read, HMI, and spontaneous speech), these results show that the combined approach also generalizes over speech styles. Importantly, the performance on adults’ speech was maintained. Thus, reducing spectral variation resulting from vocal tract length differences, which are particularly relevant to children’s speech, does not impact performance on adults’ speech recognition.
In our experiments, we trained different VTLN models using adults’ speech and children’s speech from native and non-native speakers (Dutch only) and from three different languages. Each VTLN model exhibited variations in estimated warping factors, impacting ASR performance to varying degrees. Notably, when the VTLN model estimated warping factors that were distinct for adults and children, this generally led to improved recognition performance, particularly for younger children. In line with our no children’s speech and text data scenario, we trained a VTLN model on adults’ speech only, which showed significant improvements over baseline for all native children and teenager speaker groups for all three languages when applied in isolation and in combination with SP and SpecAug. Not surprisingly, training the VTLN model (also) on children’s speech further improved performance. This is as expected as the VTLN model was trained on the same database as the children’s speech database, thus reducing database mismatch and providing the target speech to the VTLN model for training. This scenario is nevertheless realistic as, although often both speech and transcribed text are not available for acoustic model and language model training, children’s speech audio alone is more readily available. Using VTLN trained with in-domain children’s speech is likely thus the best solution; however, using the VTLN model trained with adults’ or any other children’s speech is a good alternative solution. This is in line with our previous findings [
67], which indicated that VTLN models trained on Dutch improved the performance of Mandarin Chinese children’s speech recognition, demonstrating the generalizability of the VTLN warp factors across languages. Overall, the best results were obtained when the VTLN model was trained on training data that consisted of speech from all three languages, all ages, and all speech types. This shows that the more variable the training data are, the better the VTLN warping factors are estimated, resulting in improved recognition performance of children’s speech.
The impact of VTLN varied depending on where VTLN was applied in the automatic speech recognition process. We explored its effects when applied during both training and testing and only during testing. The approach of using VTLN during training and testing can only be applied when the model can be retrained, which is not always the case. The results show that applying VTLN only during testing gave improvements for all languages over the baseline results. Thus, even when a model cannot be retrained, applying VTLN will help children’s speech recognition performance. For Dutch, applying VTLN both during training and testing gave the best results, while for German and Mandarin this condition gave slightly worse results than the test-only condition. The difference between the languages is that for Dutch, the VTLN model was trained on adults’ speech with a wide variety of speech styles (including read speech, lecture recordings, broadcast data, and spontaneous conversations), while for German and Mandarin only read adults’ speech was used. The results of Experiment 3 showed that the VTLN model trained on a variety of languages, speech styles, and age groups outperformed the VTLN models that were trained with less diverse data. Likewise, we hypothesize that the more diverse adult Dutch training data for the VTLN model training yielded better warping factors than the less diverse adults’ speech data for German and Mandarin. This led to better normalized features, which could be learned during training, while these same normalized features were available during testing, leading to a matched train-test scenario and improved recognition performance. Importantly, the performance for adults’ speech does not degrade when VTLN is applied.
The age and gender analyses on Dutch children’s and teenagers’ speech showed that WERs are higher for younger children and then become gradually constant with age, as shown in earlier studies that use hybrid ASR systems [
6]. Although the use of VTLN maintained this trend, it improved recognition performance for younger children for all ages more compared to that of teenagers. For Dutch, the female speech was consistently recognized better than the male speech, in line with previous findings for this database [
27]. The application of VTLN gave very similar improvements for both genders in the database.
Both speed perturbations and spectral augmentation are often used as data augmentation techniques in E2E and have shown their effectiveness in improving recognition performance for adults’ speech [
21,
22] despite the fact that both methods can potentially lead to artifacts in the generated speech signal and acoustic features, respectively. Speed perturbation, for instance, alters the speech signal’s pitch and speed, which occasionally leads to unnatural or distorted sounds (as shown by a different experiments in our lab). Spectral augmentation modifies spectral characteristics; however, we do not know which spectral information is modified; thus, the model is possibly also learning artificial patterns. In this work, we did not check for these artifacts nor did we try to optimize the parameter settings of these two methods; we used the standard settings. The results shown in this paper, however, indicate that the benefit of applying SP and SpecAug is larger than the negative effect of potential artifacts. Future research could investigate whether further performance benefits could be obtained when the parameter settings are tuned to the task at hand and artifacts are removed. Regarding VTLN: While our study highlighted VTLN’s impact across different languages, its applicability and integration in E2E models may encounter challenges. For instance, because VTLN needs to be trained independently and then used as a processing step after feature extraction to warp the features for training the ASR network architecture, it may not be compatible with architectures that utilize raw waveform data rather than features. As a result, integrating VTLN into such architectures requires further exploration. In the future, we intend to explore the performance of existing pre-trained models, such as Whisper, in these languages as an alternative to the baseline model without augmentations or as an alternative to the model trained using data augmentations. By doing so, we aim to investigate whether VTLN still offers additional complementary information when employed with pre-trained models that are already trained on a diverse type and even diverse speaker groups. While retraining these pre-trained models is not always feasible or desirable for computational reasons, using VTLN only during testing could potentially enhance the recognition performance of pre-trained models without extensive retraining, with the ultimate aim to remove bias against children’s speech in automatic speech recognition.
In conclusion, this research contributes to narrowing the performance gap between children’s and adults’ speech recognition, especially when children’s speech and text data are absent for training. By training our VTLN model on adults’ speech and using state-of-the-art speed perturbations and spectral augmentation techniques applied to adults’ speech, we improved recognition performance across diverse child speaker groups, speaking styles, and languages, thus showing that these approaches generalize across age, speaking styles, and languages. Performance was further improved when children’s speech and/or highly variable speech was used to train the VTLN model. These findings highlight the potential for enhancing the End-to-End children’s speech recognition performance by (1) applying state-of-the-art techniques that have shown their effectiveness on adults’ speech ASR (the data augmentation techniques) and in hybrid ASR models (VTLN) to adults’ speech, and (2) strategically taking into account the availability of data and the feasibility of training methods to improve children’s speech recognition results in the absence of children’s speech and text data for training ASR models. This finding allows for the development of more accessible and inclusive children’s speech technology applications.