Results for evaluations are listed in
Table 4,
Table 5 and
Table 6, and all the evaluations are calculated based on the Tujia language. The experimental results in
Table 4 are obtained from only audio data, and the primary purpose is to find the modeling units (phonemes or characters) that are more suitable for the Tujia language, so all audio data are used. The data gradient in
Table 5 selects 1000, 1400, and 1800 because the multi-modal corpus of this study only contains 2105 sentences (mainly limited by the video data).
4.1. Results of Acoustic Model
To verify the effectiveness of ASR in automatic labeling of low-resource language, we undertake experiments based on state transition and sequence modeling. In state transition modeling, we use a monophone as an acoustic unit and extend to a triphone model. As the low-resource language that we study does not have a written form, we resort to IPA in labeling. Therefore, in sequence modeling based on deep learning, we use IPA characters as the modeling unit. Experimental results are shown in
Table 4.
The CER of triphone modeling is 0.18% to 2.23% lower than that of monophone modeling for the recognition of 2000, 5000, and 6000 sentences and is 3.15% and 1.13% higher for the recognition of 3000 and 4000 sentences, respectively. The result shows that when data are relatively scarce, adjacent phonemes exert an influence on the recognition accuracy, but to a limited extent.
The CER of the Transformer-CTC model is 24.02% lower than that of the HMM-based model for the recognition of 2000 sentences and is 19.32% to 25.74% lower for the recognition of 6000 sentences. The result shows that compared with other models, Transformer-CTC does not see its performance substantially improve with more data. However, for Transformer-CTC alone, it sees the CER drop by 17.8% when recognizing 2000 and 6000 sentences. This shows that more data can effectively improve Transformer-CTC’s performance.
A basic assumption of monophone modeling is that the actual pronunciation of a single phoneme is unrelated to adjacent or close phonemes. However, this assumption does not stand for languages with co-articulation, such as the Tujia language.
For the initials of the Tujia language, the stops and fricatives are only voiceless and not voiced, but voiceless initials often appear voiced in the speech flow.
The final i of the Tujia language has two phonetic variants, i and ɿ. The final is pronounced as ɿ after ts, tsh, s, and z and as i after other initials. For instance, tsi 21 is pronounced as tsɿ21, tshi53 as tshɿ53, si53 as sɿ53, and zi55 as zɿ55. The final e does not go with k, kh, x, and ɣ and follows an obvious medial i when pronounced with other initials. However, it follows the medial i but a slight glottal stop in zero-initial syllables. For instance, pe35 is pronounced as pie35, me35 as mie35, and te35 as tie35. In compound words, if the initials are voiceless stops or affricates, they are influenced by the (nasal) vowels of previous syllables and become voiced stops and affricates. Such a phenomenon takes place normally in the second syllable of bi-syllabic words and does not influence aspirated voiceless stops or affricates. For instance, no53pi21 is pronounced as no53bi21 and ti55ti53 as ti55di53.
The examples show that for Tujia language, the actual pronunciation of phonemes is influenced by adjacent or similar phonemes and may change due to different positions of the phonemes. Therefore, using central, left-adjacent, and right-adjacent phonemes as the basis for modeling and recognition can improve the recognition rate of Tujia language. However, as the amount of data decreases, the model may be unstable and over-fitted, due to little data and many parameters.
Experimental results show that sequence modeling is better than state-transition modeling. Therefore, sequence modeling is better suited for the recognition of low-resource language. Meanwhile, phonological changes are common in the low-resource language used and are not compulsory. Such changes are closely related to other sentence structures and, thus, have a significant influence on frame-based phoneme modeling, making overall optimization impossible. Therefore, for speech recognition of low-resource language, sequence modeling shows better performance.
4.2. Results of AVSR
To explore the audiovisual fusion suited to low-resource language and verify the effectiveness of the approach put forward in this paper, we design experiments comparing audio and visual recognition models, including a single audio or visual modality, Transformer-based AVSR “AV (TM-CTC)”, LSTM-Transformer based AVSR “AV(LSTM/TM-CTC)”, and feature fusion-based AVSR “AV (feature fusion)”. In the experiments, the training set contains the speakers of the test set. Experimental results are shown in
Table 5.
As shown in the experiment, if the training set includes all test set’s speakers regardless of their differences, CER will increase as the sentences in the dataset decrease in single modeling based on visual or audio data. For a single modality, the CER of video modeling is 5.2% to 8.0% lower than that of audio modeling. The CER of AVSR is lower by 16.9% to 17.1% than that of the single modality. Among different approaches of AVSR, the LSTM-Transformer approach put forward by the paper enjoys the best performance, achieving a minimum CER of 46.9% in recognizing low-resource language. This CER is 5.6% to 14.5% lower than those of the other two approaches. The Transformer-based fusion model takes the runner-up place in this comparison. With scarce data, feature fusion approaches do not see their loss decrease and the performance is worse than that of video modeling.
As the sampling of audio information is undertaken in a natural environment, noises are inevitable and interfere with audio models. However, video models, focusing on visual information, recognize speech contents with speakers’ lip movement. In this way, noises’ influence is reduced and the comparative experimental results of a multi-modality and single modality verify this reduction. In addition, the proposed AVSR model can significantly improve the recognition accuracy compared with the single modality.
Figure 5 shows the AVSR performance in recognizing initials. The results show that compared with audio recognition alone, the AVSR (with visual information) enjoys higher accuracy. The comparison results of (a) and (b) show that the chances are less for affricates to be confused with fricatives, which may be due to the fact that the lip shape of affricates differs greatly from that of fricatives; combining visual and audio information helps to discriminate these two kinds of phonemes. Similarly, the accuracy of recognizing nasals m, n, and ŋ also increases and the nasals are less likely to be confused with fricatives and semivowels w and j in (c) and (d). It may be because the lip movement of nasals is continuous but minor and that of fricatives and semivowels is obvious.
In addition, from the comparison of (e) and (f), it can be seen that the results of the two methods are similar, and the AVSR cannot significantly improve the confusion of aspirated stops, nasals, laterals, and voiceless fricatives. The reason may be that the pronunciations of these phonemes are quite different and can be distinguished by sounds. However, some of their lip movements have changed significantly, and some have no obvious changes. The former can increase the recognition accuracy, while the latter increases the confusion of some pronunciations, resulting in insignificant overall changes.
However, unaspirated stop p is more likely to be confused with aspirated stop ph, unaspirated fricative t with aspirated fricative th, and unaspirated stop k with aspirated stop kh. These three pairs are bilabials, blade-alveolae, and velars, respectively. The unaspirated and aspirated phoneme differ slightly in the lip’s shape and mainly in the tongue’s position. It is difficult to tell the difference in the video mode but easier in the audio mode. Introducing video information, thus, hampers the model’s decision. Similarly, in affricates, unaspirated tɕ and aspirated tɕh are more likely to be confused, so do unaspirated ts and aspirated tsh.
Figure 6 shows AVSR’s performance in recognizing finals where the accuracy improves substantially. Compared with in audio modeling alone, diphthongs are less likely to be confused with monophthongs in AVSR modeling. It may be due to some diphthongs differing greatly from monophthongs in lip movement. The introduction of visual information, thus, greatly improves accuracy. In addition, nasal vowels are much less likely to be confused with compound vowels. Compared with initials, finals differ mainly in mouth shape and lip movement and slightly in tongue position, so the influence of visual information is more obvious.
4.3. Results of Speaker-Independent Experiments
To explore AVSR’s generalization onto different speakers, we also design speaker-independent experiments. In the experiments, the training set contains the speakers of the test set. The results are shown in
Table 6.
With the speaker difference taken into consideration, a decrease in data leads to an increase in CER. Video-only modeling has a better performance than audio-only modeling. Fusion modeling has a better performance than single modeling with the CER dropping by 6.8% to 18.8%. The performance gap here is larger as compared with the modeling that does not take into account speaker difference. For fusion modeling approaches, LSTM-Transformer modeling still has the best performance with modeling using Transformer as both the encoder and decoder taking the second place. Training is impossible for unseen speakers in feature fusion modeling.
Figure 7 is a comparison of the experimental results of the single modality and multi-modality on overlapped speakers and unseen speakers.
Figure 7 shows the character error rate distributions and differences in different methods under different data sizes. We choose 9:1 as the training-test ratio in the experiment; 1800 sentences of audiovisual data are used, which contain the information of 4 speakers; 1620 sentences are used for training; 180 sentences are used for testing. The speaker-dependence experiments consist of training and testing the two models. The first training obtained model uses the speech of 4 speakers as the training set, and the test results are shown in
Table 5; the second training obtained model uses only the speech of 3 speakers as the training set and then uses the speech of the fourth speaker to test the model. The test results are shown in
Table 6.
Comparing the results of overlapped and unseen speakers as shown in
Figure 7, we find that the CER of single and fusion modeling is higher when including speaker difference (the training set does not include test set’s speakers) than when excluding such a difference. However, the CER of fusion modeling is evidently lower than that of single modeling in both situations. The fusion modeling overall achieves better performance than single modeling. The LSTM-Transformer modeling we propose has the lowest CER, 17.2% lower than that of single modeling. LSTM-Transformer modeling is subject to a higher error rate when the data amount changes as compared with the modeling with Transformer as both the encoder and decoder and video-only modeling. However, the fluctuation is within a smaller margin, only 1.25% higher than the optimal result. Therefore, fusion modeling is more suitable at recognizing Tujia speeches with lower speaker dependence and fewer speakers.