1. Introduction
Speech is the most basic form of human communication, it is simple and convenient to use. Language instruction is now allocated to different groups of people to meet different needs of targets. Language learning, generally, can be divided into first language learning, second language learning, or third language learning. Moreover, language therapy is also available for people with disabilities or disorders. Therefore, there are various groups of language learners with different objectives. For example, (1) according to government policy, students learn their mother tongue and a second language in schools; (2) to achieve career advancement, a third language is needed for the working-age group; and (3) in some cases, patients undergoing laryngeal surgery need to practice speaking after treatment. Those who practice language teaching, therefore, can be divided into three main groups: language teachers, linguists, and language therapists.
1.1. Motivation to Solve Pronunciation Problem of Thai Vowels
Pronunciation correction is an essential goal of language learning. Correct and clear pronunciation is crucial for ensuring that listeners correctly comprehend the meaning of certain words. Good pronunciation makes communication more effective [
1]. Proper language practice can improve pronunciation skills and helps learners pronounce languages more accurately. Generally, language teachers, specialists, linguists, speech therapists, and native speakers are the people who explain and teach pronunciation to language learners using the listening method.
However, because of diverse circumstances between teachers and learners, inaccuracies in judging the appropriateness of pronunciation through hearing are conceivable. The quantity of teachers available is insufficient to meet the needs of all students. A teacher’s job is to listen to and correct the mispronunciations of many students throughout the day. As a result, weariness may cause them to lose focus on major information. Furthermore, learners can only perfect their pronunciation in the classroom.
Because of the aforementioned limitations, coupled with the situation of the COVID-19 pandemic at present, online learning has become necessary. To solve these problems, an automated system for the intelligent assessment of pronunciation practice is an alternative choice. It can improve the efficiency of the learning process for both learners and teachers because learners are able to study at any location and at any time. Therefore, this research collected recordings of Thai vowels in various environments, for instance at schools, cafeterias, parks, classrooms, bedrooms, and homes, as well as various native speakers in real-world situations, to train the model.
With the advancement of information technology and increased computing capabilities, deep learning is increasingly applied for recognition tasks. Deep learning is effective in learning and classification, [
2] such as handwritten character recognition and speech recognition. Deep learning is an artificial intelligence (AI) mathematical technique for classification that depends on data using a multilayered neural network. Computer-assisted language learning (CALL) and computer-assisted pronunciation training (CAPT) [
3,
4,
5,
6,
7,
8] have gained much attention in the field of language teaching and training. CALL and CAPT systems are widely used to improve language learning and teaching methods. CALL and CAPT systems can recognize speech using automatic speech recognition (ASR), which implemented deep learning structures. The deep learning model is applied to increase the accuracy of Thai vowel pronunciation classification in an automated system.
1.2. Contributions in Automatic Thai Vowels Pronunciation Recognition
In order to accurately recognize pronunciations, there are many tasks that use specific techniques or special tools to analyze them, for example, developing a 3D pronunciation learning system in Chinese [
9], using ultrasound with the pronunciation of consonants [
10], developing an application that helps to analyze tones in Thai [
11], and applying the Praat program for analyzing phonetics to demonstrate the pronunciation of vowels [
12,
13,
14,
15,
16,
17].
To increase the accuracy of pronunciation, linguists use the acoustic phonetics method for phonetic analysis. Praat [
18] is a popular tool for acoustic phonetics analysis. It is used to display two formant frequencies, which are Formant1 (F1) and Formant2 (F2). In order to display whether the pronunciation of the vowel is correct or incorrect, the location of the tongue must be represented with a specific graph. This is a complex process that requires specialists. The tongue position analysis process by a linguist starts with (1) recording and playing the vowel sound with the Praat program; (2) checking and selecting the range of the vowel sound in the syllable to measure; (3) compiling the average of the F1 and F2 of the vowels that were chosen; (4) in Microsoft Excel, recording the average values of F1 and F2; (5) using Microsoft Excel, Python, or R, creating a vowel graph of native speakers and learners; (6) comparing the graphs of native speakers and learners; and (7) presenting and explaining the incorrect pronunciation from the results to the learners and correcting them. The method for determining pronunciation accuracy with graphs is a complex, time-consuming, and nonreal-time process. The user, moreover, must be an expert or someone with programming knowledge. Therefore, there are few specialists in the field today.
So far, there are a few research works conducted to classify Thai vowels pronunciation using an intelligent system. From the literature review, there is no study on automatic CAPT with Thai vowels using AI with a deep learning structure. Therefore, this research aims to design and develop an automatic CAPT system using a deep learning structure for Thai vowel speech recognition. This system is developed for solving the problems of practicing the pronunciation of Thai vowels for (1) nonnative learners or nonstandard Thai speakers and (2) people with pronunciation disabilities; (3) for solving the shortage of specialists in teaching Thai vowels pronunciation; (4) for solving the original process, which is complicated and time-consuming and does not present results in real time; and (5) inventing a new tool for learning languages online that is appropriate for the current situation.
A deep learning structure of CAPT for Thai vowels speech is designed to recognize the 18 basic Thai vowels. The pronunciations of the 18 Thai vowels are difficult. Some phonemes have similar characteristics. Therefore, nonnative learners cannot distinguish them and require assistance from an expert. In this research, a convolutional neural network (CNN) is trained over the dataset for Thai vowels speech classification. This model is created to train a computer to recognize vowels like an expert who can identify learners’ vowel pronunciation. The existing Thai audio corpus is not suitable for the objectives of this research. Therefore, to obtain theoretically qualitative vowel data in linguistic principles, a new dataset is designed and collected. The dataset used for training this model consists of voices collected from various dimensions in real-life contexts, such as gender, age, accent, environment, and noise. The major contributions to this work are as follows:
This research proposes a noisy dataset that is collected from standard Thai speakers from various dimensions. The dataset is designed, collected, and examined by a linguist.
This research proposes a method for Thai vowel speech recognition based on a convolutional neural network (CNN), which is one of the most well-known deep neural networks (DNNs). The CNN model is applied to automatic speech recognition on the automatic CAPT for Thai vowels. The optimal CNN model is utilized to learn the spectral characteristics of 18 Thai vowel classes.
This research generates two different acoustic feature inputs for CNN and long short-term memory (LSTM) models. They are MS and MFCC acoustic features from the raw speech waveform to learn deep multimodal features.
This research proposes automatic CAPT for Thai vowel speech that can display the learners’ vowel pronunciation results. If the pronunciation is incorrect, then it will suggest the correct practice with text, real video, and 3D video.
The automatic CAPT for Thai vowel speech uses a deep learning structure. It is a new system that develops computer techniques integrated with linguistic theory. The system can be used to guide learners such as the voice impaired, nonnative learners, and nonstandard Thai speakers. This system, therefore, allows learners to practice vowel pronunciation in real time, similar to having an expert, Thai teachers, and linguists provide assistance on the correctness of pronouncing vowels.
The outcome of this work benefits the innovation of advanced systems, for example, the classification of tones, words, phrases, and sentences to facilitate learning standard Thai pronunciation.
The remainder of the paper is organized as follows. In
Section 2, the background is described, related works are presented in
Section 3, and materials and methods are shown in
Section 4.
Section 5 describes the results of the experiments, and conclusions are displayed in
Section 6. In
Section 7, the discussion is presented. Finally, the automatic computer-assisted pronunciation training for Thai vowels is presented. The definitions of variables and acronyms are shown in
Appendix A (
Table A1) after the last section.
5. Results
This experiment utilizes combinations of two acoustic feature inputs and three model settings as follows: (1) the MFCC acoustic features combined with a baseline CNN model, (2) the MS acoustic features combined with the baseline CNN model, (3) the MFCC acoustic features combined with the baseline LSTM model, (4) the MS acoustic features combined with the baseline LSTM model, (5) the MFCC acoustic features combined with the optimized model, and (6) the MS acoustic features combined with the optimized model.
Table 3 presents the different experimental results. The parameters are set with 32 batch sizes and 500 epochs. The MS acoustic features combined with the baseline CNN model achieves the lowest accuracy at 88.89%. In the third and fourth experiments, the MFCC and MS acoustic features on the baseline LSTM model achieve low accuracy, 94.44% and 90.00%, respectively. LSTM layers are beneficial for learning long-term contextual dependencies from long sequences. Here, in contrast, LSTM is used on the Thai vowel dataset which is one-syllable words, not long sentences. Therefore, LSTM is not outstanding in this task. The optimized CNN model combined with MS acoustic features achieved improved accuracy of 98.61%. The result of the optimized CNN model shows that the MS acoustic features perform the best for Thai vowel classification.
The line graphs of the accuracy and loss of the optimized CNN models combined with MFCC or MS acoustic features are presented in
Figure 6a,b. The visualization shows the line graph that compares the accuracy and loss of the training and testing models from 0 to 500 epochs. The optimized MS acoustic features combined with the optimized CNN model outperform the optimized CNN combined with MFCC and achieve the best accuracy of 98.61% as shown in
Figure 6b. The loss values are presented in
Table 3. The baseline CNN model with the MFCC or MS acoustic features obtains higher loss values and has more overfitting problems than the other models. The optimized CNN model combined with the MFCC or MS acoustic features can reduce the overfitting problems as illustrated in
Figure 6a,b.
For the error analysis, the confusion matrix of the optimized CNN model for Thai vowel recognition is shown in
Figure 7. For the details of misclassification, 10 of the 18 classes have a 0% error rate. The ‘เออะ’ /ɤ/ vowel is the class that has three mispredictions. In the confusion matrix, the most perplexing Thai vowel pairs are (‘เออะ’ /ɤ/) and (‘อึ’ /ɯ/). These sounds are similar, which can be explained by linguistic theory as they share the same characteristics. The (‘เออะ’ /ɤ/) and (‘อึ’ /ɯ/) pronunciations both use the back part of the tongue [
61]. As a result, the Thai vowel recognition model may be confused.
Table 4 shows the precision, recall, and F1 scores of the CNN model for classifying each Thai vowel. The lowest F1 score on the dataset is 0.89 for (‘อึ’ /ɯ/). The F1 score results are relevant to the confusion matrix. On the other hand, the highest F1 score (i.e., 1.00) on the dataset can be seen for (‘อิ’ /i/), (‘เออ’ /ɤ:/), (‘อี’ /i:/), (‘เอ’ /e:/), and (‘ออ’ /ɔ:/).
Based on error analysis and the evaluation of the CNN, an accuracy of more than 95% was achieved. This optimized CNN model was implemented on the automatic computer-assisted pronunciation training (CAPT). In this experiment, Thai vowel sounds were classified using the CAPT in real situations with unseen data. The recognized vowel results from the system were compared with those perceived by a linguist and a native speaker. The unseen dataset was received from 4 users (2 males and 2 females). All of them were 16 to 30 years old. Each user practiced 18 vowels and spoke 3 times. The total unseen dataset comprised 216 sound files (18 vowels × 4 users × 3 times). In the unseen data, the recognized results for the system of 22 vowel sounds (10.19%) do not match the vowel sounds perceived by a linguist and a native speaker, which are presented in
Table 5. A total of 194 vowel sounds (89.81%) match the linguist’s and the native speaker’s perceptions.
Table 6 shows the system’s most often predicted pairings of vowels that do not match those of a linguist or a native speaker, which are (‘แอ’/ε:/) and (‘ออ’/ɔ:/), (‘เอ’/e:/) and (‘เออ’ /ɤ:/), (‘อี’/i:/) and (‘เอ’/ e:/), and (‘อือ’/ɯ:/) and (‘ออ’/ɔ:/). Each of them has a two-time mismatched pronunciation frequency. These can be explained in linguistic theory with the fact that the mispronounced pairs are related to the similar tongue positions, which are front–back and high–low.
6. Conclusions
Vowels are the core of syllables (nucleus) and are an important part of speech. Vowels are produced in the oral cavity depending on the tongue’s position. Thai vowel pronunciation practice is difficult for nonnative speakers to easily understand by themselves. Experts are required to provide advice. However, today, there is often an inadequacy of instructional specialists. To solve these problems, technology for pronunciation practice should be implemented. This research presents the appropriate acoustic features and an optimal CNN model for noisy Thai vowel speech recognition that is applied in an automatic CAPT system.
The CAPT system is developed to be used in daily life learning activities that can be practiced anywhere and anytime. Therefore, the noisy Thai vowels dataset is collected from native speakers in real-world environments with variations in dimensions such as gender, age, accent, environment, and noise. The dataset, moreover, is designed, collected, and verified by a linguist based on linguistic theory. The 2D-CNN model combined with MS acoustic features improves performance in Thai vowel speech recognition. The model achieves a significant increase in accuracy of 98.61% over the baseline model by employing various strategies and hyperparameter tuning. Finally, the model is implemented on the CAPT system in a realistic situation. The input data received from learners are considered to be invisible data. The recognized vowel resulting from the CAPT system is compared with perceived vowel sounds by a linguist and a native speaker, and it achieves an accuracy of 89.81%. The extraction of vowel acoustic features that apply to MS combined with CNN provides distinctive acoustic features for Thai vowel speech recognition. This model can distinguish vowel sounds even though the data have various noise, ages, accents, environments, and physical characteristics (i.e., female vs. male voices).
The automatic CAPT system uses the optimal CNN model combined with appropriate MS acoustic features for Thai vowel speech recognition. It can solve problems such as lack of expertise, complexity, time consumption, and lack of real-time feedback. The contribution of this work is that its findings are beneficial for stakeholders who are interested in developing assistive Thai vowel recognition systems or similar pronunciation systems. This work enables researchers to produce learning applications by following similar operations. Moreover, it can help guide and assist voice-impaired, nonnatives, or nonstandard Thai dialect learners, allowing learners to practice vowel pronunciation in real time, anywhere, and anytime, comparable with having an expert, Thai teacher, or linguist advise on pronunciation at all times. For future works, the authors will focus on novel methods of Thai vowel speech recognition in other aspects of Thai words to improve CAPT system.
7. Discussion
This research uses AI with a deep learning model to automate computer-assisted pronunciation training (CAPT) with Thai vowels. The recognition model for the standard 18 Thai vowels is an essential function for automatic speech recognition. The model is applied to recognize learners pronouncing vowel sounds. To be effective in recognition, proper data and structure are required for training the model.
The existing Thai audio corpus cannot be used for the objectives of this research. Therefore, to obtain theoretically qualitative vowel data in linguistic principles, a new dataset was designed, collected, and verified based on linguistic theory by a linguist. The dataset was gathered from native speakers in real-world scenarios, so the CAPT system may be used everywhere and at any time. The dataset consisted of noise at 30–40 dB SNR such as vehicles from the road, people talking, wind at the park, and animal sounds. Therefore, the available data are categorized as the “noisy Thai vowels” dataset. The sounds were recorded with a mobile phone by 50 Thai native speakers. The total number of vowels were 1800 sound files divided into 18 classes.
The CNN model in this research can reduce the frequency variation of the acoustic features and extract appropriate features. This research preserves the sizes of the feature maps with padding strategies as in [
39]. The CNN model reduces the spectral variability in the input features with max pooling as done in [
38] and decreases the overfitting problem with the dropout strategies as in [
58]. The ELU adopted in the CNN model provides outstanding results. It corresponds to those reported by [
51,
59]. In [
50], ELU was applied to each LFLB. CNN-LSTM networks were constructed to learn local and global emotion-related features from speech and log-Mel spectrograms. The 2D CNN_LSTM network that used the ELU activation function achieved recognition accuracies of 95.33% and 95.89% in Berlin EmoDB in speaker-dependent and speaker-independent experiments. The ELU facilitates faster CNN learning and led to higher classification accuracies. The benefits of those strategies improve the performance of the Thai vowels CNN model.
Given the acoustic features of the model, the time and frequency extension techniques in [
39,
43] can be used for this CNN model. The results indicate that the optimum MS acoustic features are 11 × 128 (time × frequencies). The results of this study differ from those in [
43], which applied the CNN model with MFCC acoustic features. Two datasets (female and male) were used in that work. The suitable acoustic feature sizes for the two datasets were different. They were 11 × 40 and 11 × 64, respectively. The most accurate rates of the CNN model were 90.00% and 88.89% for female and male voices, respectively. Another difference is that in this study, the dataset is a combination of female and male voices. This study also differs from [
49], which used MFCC acoustic features with CNN and Laplace HSMM for neonatal bowel sounds. The sequence of 24 MFCCs long was applied to the input to feed into the 1D CNN model. The two classes were peristalsis and no peristalsis. The four convolutional layers were contained in that work. The results had accuracies of 89.81% and 83.96% AUC, respectively. In those works, MFCC used a logarithmic frequency scale and DCT, while in this study, MS uses a linear frequency scale. Therefore, different types of sounds and acoustic features lead to different sizes of acoustic feature settings in terms of frequencies and model structure.
According to the results, the optimized CNN model with Mel spectrogram provides outstanding performance with 98.61% accuracy, which is higher than MFCC with the baseline LSTM model and MS with the baseline LSTM model, which had accuracies of 94.44% and 90.00%, respectively. LSTM layers are beneficial for learning long-term contextual dependencies from long sequences. In contrast, LSTM is used on the Thai vowel dataset, which is one-syllable words mand not long sentences; therefore, LSTM is not effective for this task. MS is used in conjunction with CNN, and it can distinguish vowel sounds, although the aggregate dataset is more complex due to many dimensions, such as various noises, ages, accents, environments, and physical characteristics (i.e., female vs. male voices). In the same way [
46], MS was applied to the speech command recognition (SCR) task and achieved good performance. MS images with a feature size of 125 × 80 × 1 were used as acoustic features. The light interior search network (LIS-Net) model was applied to the SCR task using the Google Speech Command dataset. The results of the LIS-Net model achieved 97% accuracy.
In this research, the vowel speech dataset used is similar to those used in [
44,
45] that used a vowel speech dataset for classification with CNN. The classification results were 94% and 99.6% accurate, respectively. This research differs from [
44,
45], which focus on Javanese vowel sounds. The dataset in [
44,
45] did not describe how to design, collect, and verify data under a linguistic theory. That dataset was recorded from only one speaker. Only 250 Javanese middle vowels sound files were collected, which were divided into 5 classes (/e/, /ε, /«/, /o/, and /ɔ/). The dataset in this research work has more diverse dimensions than [
44,
45]: the number of speakers, the total audio files, the variety of classes, and diverse sound environments. The recognition model also provides satisfactory results. Vowel pronunciation was applied to classical Arabic phonemes [
52], which also differs from this research; in that study, the data consisted of 28 consonants associated with 3 short vowels, and a CNN was used to categorize 84 classes. A total of 6229 recorded items in the dataset were documented online from 85 speakers (81 native and 4 nonnative Arabic speakers). The model obtained an accuracy of 95.77%. Thai vowel recognition using deep learning is compared with vowel recognition tasks using deep learning in
Table 7.
In many works, models are not deployed in real-world situations. As a result, when they are applied to systems, the true impacts cannot be determined. To examine robustness, the CNN model is implemented on the CAPT system in a real situation. The input data received from learners are considered invisible data. When compared with the linguist’s and the native speaker’s perceptions of the vowel sounds, the CAPT system’s vowel detection was 89.81% accurate. This indicates that utilizing a variety of data dimensions and designing, collecting, and verifying data are very helpful for creating high-quality input data for speech recognition models.
The automatic CAPT uses the CNN model for Thai vowel speech recognition. It solves problems such as lack of expertise, complexity, time-consuming processes, and nonreal time processes. Deep learning is applied to an ASR recognition model to recognize the learner’s pronunciation. The system uses raw speech input data and uses MS acoustic features along with CNN to extract the distinctive features of the vowels and classify them. After that, the vowel derived from the classification is compared with the vowel selected by the learner. If the comparison matches, the learner’s pronunciation is correct.
To increase the preciseness of pronunciation, linguists use the acoustic phonetics method for phonetic analysis. In general, those research works necessitate the hand-crafted extraction of formant1 (F1) and formant2 (F2), which presents significant challenges due to the large amount of data and speaker acoustic variation. This work differs from previous studies that used Praat with phonetics principles in the analysis of pronunciation [
12,
13,
14,
15,
16,
17]. In general practice, linguists usually employ the F1 and F2 using Praat. Then the F1 and F2 of native and nonnative speakers are plotted using Microsoft Excel, Python, or R programming language. After that, the graphs are compared to find the differences between the native speakers and the learner; differences mean the pronunciation is incorrect. These normal steps are a multistep rather than real-time approach, and they require someone with expertise in Praat or someone who can program in R or Python language.
Those traditional methods by linguists may be complex for others, whereas this research provides a very uncomplicated method. It works in real time and can solve those problems. Moreover, the technique can be utilized for developing CNN models in a similar domain by applying various strategies, optimizing the neural networks, and adjusting the hyperparameters for improved performance in learning assist systems in the future.