2.1. Low-Resource Speech Recognition
The Karelian language belongs to the Balto-Finnic group the Uralic language family. Linguists distinguish three main dialects of Karelian: Karelian Proper, Livvi-Karelian, and Luudi-Karelian [
13]. It is worth mentioning, however, that Luudi-Karelian is treated as a separate language (Ludian) in some works [
14]. Since today Livvi-Karelian is the most widespread dialect of Karelian [
15], being widely represented in the Karelian media, the authors of this paper only focused on Livvi-Karelian data.
Livvi-Karelian falls within the category of the “low-resource languages”. Under the term “low-resource languages” (or “under-resourced languages”) are meant languages with a limited number of electronic resources available. This term was first introduced in [
16,
17]. A set of criteria was proposed to classify a language as low-resource, among which were a writing system, availability of data on the Internet, descriptive grammars, electronic bilingual dictionaries, parallel corpora, and others. In the following works [
18], the notion of low-resource languages was further expanded to consider factors such as a low social status of a language and its limited study. Nowadays, however, the main criterion for classifying a language as low-resource is the scarcity of electronic data available to researchers [
19].
Low-resource languages hold significance not solely for linguists due to their role as means of communication in many societies. Currently there exist about 2000 low-resource languages spoken by more than 2.5 billion people in Africa and India alone. Developing tools for natural communication with speakers of these languages can help address a wide range of economic, cultural, and environmental issues.
The scarcity of language data is a complex problem that impacts various aspects of language processing: phonetic, lexical, and grammatical [
20]. In simple terms, lack of data hampers the direct application of “classical” approaches to automatic speech recognition and translation, which usually imply the use of acoustic, lexical, and grammatical (language) models. Usually, an automatic speech recognition (ASR) system (the “standard” approach) consists of an acoustic model (AM) that establishes the relationship between acoustic information and allophones of a language at issue [
21], a language model (LM) necessary for building hypotheses of a recognized utterance, and a vocabulary of lexical units with phonetic transcriptions. The training of acoustic models involves utilizing a speech corpus, while the development of the language model draws upon probabilistic modeling using available target language texts (as illustrated in
Figure 1).
A speech recognition system operates in two modes: training and recognition. In the training mode, acoustic and language models are created, and a vocabulary of lexical units with transcriptions is built up. In the recognition mode, the input speech signal is converted into a sequence of feature vectors, and the most probable hypothesis is found using pretrained acoustic and language models [
22]. For this purpose, the maximum probability criterion is employed:
where
O represents a sequence of feature vectors from a speech signal and
W is a set encompassing all potential sequences of words. The probability
P(
O|
W) is calculated with AM, while the probability
P(
W) is derived through LM. Hidden Markov models (HMM) can be used as AM, with each acoustic unit being modeled by one HMM, typically with three states. In this case acoustic probability is computed using the following formula [
23]:
where
q is a sequence of HMM states, π(
q0) and
are the initial state probability and state transition probability, respectively, determined by the HMM, and
qt is a state of HMM at time
t.
Nowadays deep neural networks (DNNs) are widely used for training both the acoustic and language models. For acoustic models, DNN are combined with HMM, forming the so-called hybrid DNN/HMM models. In this case, DNN are employed to derive the posterior probability of HMM, wherein HMM capture long-term dependencies and DNN contribute discriminant training capabilities. At the decoding stage, posterior probability
P(
Ot|
qt) should be converted to the likelihood [
23]:
where
P(
Ot) is independent of the word sequence and therefore it can be ignored. Thus, for the decoding pseudo-likelihood is used:
LMs are typically developed using either a statistical n-gram methodology or a recurrent NN (RNN) approach. In RNN, the hidden layer stores all preceding history, in contrast to feedforward NN which can store context only of restricted length.
The main methods for acoustic and language modeling are summarized in
Table 1.
Despite the recent widespread use of the end-to-end [
24] approach to speech recognition, the standard approach remains the preferred choice for low-resource languages due to its requirement for less training data. For example, the hybrid DNN/HMM approach was used in [
25] for speech recognition in the low-resource Sinhala language. The results obtained by the authors show that the use of hybrid DNN/HMM acoustic models outperforms HMM based on Gaussian mixture models (GMM) by 7.48 in terms of word error rate (WER) on the test dataset.
In [
26], the results of experiments on multilingual speech recognition of low-resource languages (10 languages from the set proposed as part of the OpenASR20 contest, as well as North American Cree and Inuit languages) were presented. The authors experimented with factorized time delay neural networks (TDNNs)—TDNN-F in hybrid DNN/HMM acoustic models and have shown that this architecture outperforms long short-term memory (LSTM) neural networks (NNs) in terms of WER. A similar conclusion was made in [
27] for the Somali language data.
In a range of studies addressing the Russian language, it has also been shown that hybrid acoustic models based on TDNN are superior to HMM, or hybrid DNN/HMM [
28,
29].
Based on these examples, the authors of this paper decided to adopt the standard approach in developing their speech recognition system for Livvi-Karelian, and to choose hybrid DNN/HMM acoustic models.
2.2. Speech Data Augmentation: Main Approaches
As previously mentioned, one of the most important prerequisites for an ASR system development is the availability of training data (audio and text corpora). This holds particular significance for the Karelian language. An effective approach to the data scarcity problem is data augmentation. Data augmentation refers to a set of methods used to create additional data for training models, either by modifying data or by adding data from external databases. It is well known that augmentation techniques can solve overfitting problems and improve the performance of ASR systems [
30] to the audio spectrogram. By employing these augmentations, the dataset is effectively expanded due to numerous variations of input data. One can list the following methods for data augmentation: speech signal modification, spectrogram modification, data generation.
2.2.1. Speech Signal Modification
Speech signal modification can be performed by changing voice pitch, speech rate, and speech volume, adding noise, and modifying features extracted from the speech signal. An illustrative example of speech signal modification through augmentation is presented in [
31], where the authors changed speech rate by multiplying the original speed by coefficients of 0.9, 1.0, and 1.1. The effect of these transformations was a 4.3% reduction in WER. Augmentation by adding random values to speech features is presented in [
32]. Some researchers combine several types of augmentation. For example, a two-stage speech data augmentation is proposed in [
33]. In the first stage, random noise was added and speech rate was modified in order to enhance the robustness of acoustic models. In the second stage, feature augmentation was performed on the adapted speaker-specific features.
Voice conversion technology can also be classified as an augmentation technique, involving modification of the speaker’s voice (source voice) so that it sounds like another speaker’s (target voice) while linguistic features of the speech remain unchanged [
34]. Generative Adversarial Networks (GANs) are commonly used for this purpose [
35], along with their variants, such as Wasserstein GAN and StarGAN. In [
36], a method called VAW-GAN is proposed for non-parallel voice conversion, combining a conditional variational autoencoder (C-VAE) and Wasserstein GAN. The former models the acoustic features of each speaker; the latter synthesizes the voice of another speaker. The StarGAN architecture is used in [
37], which presents the StarGAN-VC method for voice conversion. Several types of data augmentation were applied in the work [
38] for Turkish speech recognition. The authors explored different augmentation techniques, such as speech rate modification, volume modification, joint modification of speech rate and volume, and speech synthesis (investigating Google’s speech-to-text conversion system and an integrated system for synthesizing Turkish speech based on deep convolutional networks). Additionally, various combinations of the described methods were employed. The best result was achieved by jointly applying all methods, resulting in a 14.8% reduction in WER.
Mixup [
39] is another speech augmentation method that creates new training samples with linear interpolation between pairs of samples and their corresponding labels. Mixup generates a new training sample by taking a weighted average of the input features and labels for two randomly selected samples and their labels. This encourages the model to learn from the combined characteristics of multiple samples, leading to better generalization. Mixup has been widely used in image classification tasks, but can also be effectively applied to other domains, such as audio processing.
The SamplePairing technique [
40] involves pairing samples from the training set in random order. For each pair, the features are combined by taking an average value, in a way similar to the Mixup technique. The labels of the paired samples are typically ignored during training, and the model is trained to predict an average output. This method enhances robustness by exposing the model to diverse combinations of samples.
Mixup with label preserving methods is another augmentation approach [
41] which extends Mixup by incorporating label preserving. It applies modifications to the input features while preserving the original labels. This results in model learning invariant features while maintaining the correct class assignments.
2.2.2. Spectrogram Modification
SpecAugment [
42] operates on the spectrogram representation of audio signals; the main idea behind this technique is applying a range of random transformations to the spectrogram during training. These transformations include time warping, frequency masking, and time masking.
Time warping in SpecAugment involves stretching or compressing different segments of the spectrogram in the time domain. It introduces local temporal distortions by warping the time axis of the spectrogram. This transformation helps the ASR model handle variations in speech speed, allowing it to be more robust to different speaking rates exhibited by different speakers.
Time masking technique allows selecting anchor points along the time axis of the spectrogram and warping the regions between them. The anchor points are randomly chosen, and the regions between them are stretched or compressed by a certain factor. Stretching or compressing introduces local temporal distortions, simulating variations in speech speed.
Frequency masking transformation masks consecutive frequency bands in the spectrogram. The model becomes more robust regarding variations in pitch and speaker characteristics due to randomly masking a range of frequencies, while time masking transformation masks consecutive time steps in the spectrogram. By randomly masking out segments of the audio signal, the model learns to be invariant to short-term temporal variations.
Vocal Tract Length Perturbation (VTLP) [
43] is a method of spectrogram transformation using random linear distortion by frequency measurement. The main idea is to apply normalization, not to remove variations but, on the contrary, to add variations to audio. This can be obtained by normalizing to an arbitrary target instead of normalizing to a canonical mean. For VTLP, a deformation coefficient is generated for each sample, and the frequency axis is deformed so that the frequency (f) is mapped to the new frequency (f’). VTLP is typically applied by modifying the speech features, such as the mel-frequency cepstral coefficients (MFCCs) or linear predictive coding (LPC) coefficients. The perturbation is achieved by scaling the feature vectors along the frequency axis, mimicking the effects of different vocal tract lengths on the spectral envelope of the speech signal.
2.2.3. Data Generation
Another method of speech data augmentation is speech synthesis. Recently, among the most successful models are WaveGAN and SpecGAN [
44,
45]. The main difference between WaveGAN and SpecGAN lies in the domain in which they generate audio data. WaveGAN is a GAN used for the synthesis of high-fidelity raw audio waveforms. WaveGAN consists of two main modules: a generator and a discriminator. The generator network takes random noise as input and generates synthetic audio waveforms. The discriminator network tries to distinguish between the real audio samples from the dataset and the generated samples from the generator. In order to capture the temporal dependencies WaveGAN uses a convolutional neural network (CNN) architecture for both the generator and discriminator networks. The generator progressively upsamples the noise input using transposed convolutions to generate longer waveforms, while the discriminator performs convolutions to analyze and classify the input waveforms.
SpecGAN, also known as Spectrogram GAN, is a GAN architecture network specifically designed for the generation of audio spectrograms. It generates audio by synthesizing spectrograms and is trained on spectrograms extracted from real audio data.
The Tacotron 2 generative NN model [
46] developed by Google is used for speech synthesis. For instance, this method was used in the work [
47] for synthesizing child speech in the Punjabi language. Furthermore, in this study, augmentation was achieved by modifying formants in the speech recordings of an adult speech corpus. In [
48], speech synthesis was employed for augmenting speech data in the development of an integrated speech recognition system, significantly reducing WER. Additionally, SpecAugment was applied, resulting in further WER reduction. A drawback of this method is a requirement for training speech data. Insufficient training data may result in unsatisfactory synthesized speech quality. An illustrative case is [
49], where the incorporation of synthesized data during the training of acoustic models failed to enhance recognition accuracy. In their work, the researchers employed statistical parametric speech synthesis techniques. Despite their efforts, they encountered challenges in achieving satisfactory quality when synthesizing speech through Tacotron 2 and WGANSing models (a speech synthesizer utilizing GANs). The authors attributed the poor synthesis quality to a lack of training data.
The main approaches to speech data augmentation are presented in
Table 2:
Overall, the best augmentation technique or combination of techniques depends on the specific ASR task, available training data, and desired improvements in performance. It is often beneficial to experiment with multiple techniques and assess their impact on the ASR system’s performance. For example, speech signal modification must not make speech data linguistically implausible. Voice conversion requires parallel training data, where the source and target voices are aligned, which may limit its application in some scenarios. The quality of voice conversion may vary depending on the training data and the similarity between source and target speakers. Spectrogram modification usually requires fine-tuning to strike a balance between data diversity and maintaining speech quality. The effectiveness of Mixup and SamplePairing techniques, as well as data generation, may vary drastically depending on dataset, potentially resulting in unsatisfactory speech quality.
Concluding the review of related work, it should be noted that DNN-based ASR systems are typically trained using tens and hundreds of hours of speech data. The current research aimed to investigate the feasibility of training DNN models for an ASR system on very limited training data, approximately 3 h of speech, and explore data augmentation methods to enhance speech recognition results. Moreover, despite a long-standing literature tradition and current interest of linguists to the language and folklore of the Karelians, Karelian remains an under-resourced language. The survey of related work has shown that there is no ASR system for Karelian language.