Next Article in Journal
Musculoskeletal and Sociodemographic Gender Differences between Vocational Ballet Students
Next Article in Special Issue
MiniatureVQNet: A Light-Weight Deep Neural Network for Non-Intrusive Evaluation of VoIP Speech Quality
Previous Article in Journal
Effect of P+ Source Pattern in 4H-SiC Trench-Gate MOSFETs on Low Specific On-Resistance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection

by
Md. Anwar Hussen Wadud
1,
Mohammed Alatiyyah
2,* and
M. F. Mridha
3
1
Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka 1216, Bangladesh
2
Department of Computer Science, College of Sciences and Humanities in Aflaj, Prince Sattam Bin Abdulaziz University, Al-Kharj 16278, Saudi Arabia
3
Department of Computer Science & Engineering, American International University of Bangladesh, Dhaka 1216, Bangladesh
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(1), 109; https://doi.org/10.3390/app13010109
Submission received: 27 November 2022 / Revised: 17 December 2022 / Accepted: 18 December 2022 / Published: 22 December 2022
(This article belongs to the Special Issue Deep Learning for Speech Processing)

Abstract

:
A crucial element of computer-assisted pronunciation training systems (CAPT) is the mispronunciation detection and diagnostic (MDD) technique. The provided transcriptions can act as a teacher when evaluating the pronunciation quality of finite speech. The preceding texts have been entirely employed by conventional approaches, such as forced alignment and extended recognition networks, for model development or for enhancing system performance. The incorporation of earlier texts into model training has recently been attempted using end-to-end (E2E)-based approaches, and preliminary results indicate efficacy. Attention-based end-to-end models have shown lower speech recognition performance because multi-pass left-to-right forward computation constrains their practical applicability in beam search. In addition, end-to-end neural approaches are typically data-hungry, and a lack of non-native training data will frequently impair their effectiveness in MDD. To solve this problem, we provide a unique MDD technique that uses non-autoregressive (NAR) end-to-end neural models to greatly reduce estimation time while maintaining accuracy levels similar to traditional E2E neural models. In contrast, NAR models can generate parallel token sequences by accepting parallel inputs instead of left-to-right forward computation. To further enhance the effectiveness of MDD, we develop and construct a pronunciation model superimposed on our approach’s NAR end-to-end models. To test the effectiveness of our strategy against some of the best end-to-end models, we use publicly accessible L2-ARCTIC and SpeechOcean English datasets for training and testing purposes where the proposed model shows the best results than other existing models.

1. Introduction

As the need for non-native language acquisition has increased, so has the population’s desire for enhanced performance from CAPT systems. Computer-aided language learning (CALL) has attracted great attention over the past decade due to its versatility, which allows learners to enhance their learning abilities at their own speed. The CAPT sub-type of computer-assisted language learning (CALL) focuses on identifying faults in non-native learners’ speech. Mispronunciation detection and analysis is one of the CAPT system’s core technologies. It is comparable to a specialized form of automated phone recognition. Mispronunciation detection and diagnosis (MDD) are accomplished [1] when the identified phones depart from the canonical productions based on the speakers’ responses to the text-based cues.
The main component of a CAPT model is MDD, which recognizes pronunciation mistakes and gives corrected feedback to assist the second-language (L2) learners [2]. Modern MDD pipelines are complex. Extended recognition networks (ERNs) were utilized by [3,4,5] to identify both standard phonemes and mispronunciations. Acoustic Phonological Model (APM) developed in [1] utilized multi-distribution deep neural network (DNN) to forecast the most probable phonetic mispronunciations. To further enhance the performance of APM models, the articulatory acoustic-phonemic model (AAPM) [6] and multi-task APM (MT-APM) [7] were developed.
Although these methods are viable, they have several drawbacks: (1) They frequently comprise several laborious-to-design phases. For example, it is challenging to construct ERNs [3,4,5] because we cannot locate all phonological rules and the rules themselves are difficult to develop. (2) Because the components of these techniques are taught individually, mistakes may accumulate. (3) When constructing a new system, the intricacy of contemporary MDD processes necessitates large engineering work.
Thus, a combined end-to-end (E2E) MDD model has several benefits. First, this solution eliminates the requirement for phonological rules and forced alignments, which may need fragile design decisions. Second, it facilitates richer conditioning on traits such as L1 (mother tongue) languages. This is due to the fact that conditioning may take place throughout the whole model, not just on particular elements. Consequently, adaptation with updated information may also be simplified. Lastly, an individual system is more likely to be more resilient than a multi-stage model in which the faults of each component might accumulate.
Researchers have recently attempted to develop an E2E method for CAPT systems. Ref. [8] suggested an approach for E2E scoring that employs both acoustic and lexical data. However, their model’s output is a basic score that does not include mispronunciation diagnosis. CNN-RNN-CTC [9] is, to our knowledge, the first end-to-end model presented for MDD. It is effective on the CU-CHLOE [10] corpus. However, not all of the provided information is utilized. Because learners are frequently exposed to sentences before reading, linguistic elements of sentences can be utilized to anticipate the phone sequence. In addition, its architecture is simplistic, and its performance is not improved beyond that of conventional approaches.
Last but not least, it was solely tested on the non-public CU-CHLOE corpus spoken by Mandarin and Cantonese speakers. We have proposed a system using NAR E2E neural models from the above limitation that can automatically identify and diagnose mispronunciation. Our proposed model combines the dictation model and the pronunciation model. There have different types of dictation models such as MASK-CTC [11], CTC-ATT [12], LASO [13], AlignRefine [14], etc., which can increase the execution speed in line with traditional E2E neural models. In our proposed model, we have used LASO (listen attentively and spell once) as the dictation model. We have used the pronunciation model on top of the dictation model in our proposed architecture to further enhance the overall system performance. The overall contribution of the paper includes the following:
  • We propose a non-autoregressive end-to-end neural network model to detect and diagnose pronunciation mistakes. The proposed automatic pronunciation model combines the dictation and pronunciation models.
  • We used an attention-based Dictation model for automatic pronunciation mistake identification and the Pronunciation model to diagnose pronunciation mistakes.
  • We trained the model on two types of publicly available datasets and used different baseline and existing models to evaluate the proposed model.
Table 1 lists the abbreviations used throughout this study. However, the article’s organization is given bellows: Section 2 highlights a literature review of the speech recognition research area. Section 3 briefly describes the proposed architecture. The experimental setup and an analysis of the results are discussed in Section 4. Finally, the conclusions and future work are discussed in Section 5.

2. Literature Review

Various methods for tackling MDD have been presented during the past decade [15,16,17,18]. “Data sparsity” is a significant technological difficulty for these models. Specifically, MDD needs a granular diagnosis of pronunciation quality. Hence, manual annotation at the phoneme level for L2 speech is frequently necessary for training these models [18]. Such annotation efforts are very labor intensive, and time consuming [19,20]. In the extant research, there are two orthogonal approaches to addressing the sparsity of L2 data. One approach is to reduce the MDD model’s reliance on L2 training data by employing poorly supervised [21] or unsupervised [22] learning approaches. This research focuses on producing fresh L2 training examples for data augmentation [23,24,25,26].
A naive method for augmenting L2 data is automatically transcribing existing L2 speech using forced-alignment models to generate phoneme annotations [27]. This method, however, does not detect pronunciation problems and results in inaccurate annotations for non-native speech. Obtaining further L2 speech is also challenging, resulting in a relatively restricted corpus of annotated L2 data [28]. A random sampling of L2 phoneme substitutions [23] or sampling based on the probability distribution of phonemes [24] can be used to gather knowledge from real mispronunciation patterns. Phoneme-to-phoneme (P2P) models may also produce mispronounced speech by perturbing the phoneme sequence of the associated L1 speech using decision trees [29] or deep learning approaches [25,30]. More recently, Korzekwa et al. [26] developed the text-to-speech (T2S) and speech-to-speech (S2S) models as two novel L2 data synthesis strategies. The proposed L2-GEN is a new data augmentation system to produce L2 phoneme sequences and related speech that captures mispronunciation patterns from non-native ESL learners in the real world. L2-GEN presents a machine learning-based L1-to-L2 phoneme paraphrasing model to generate a collection of L2 phoneme sequences given any L1 reference phonemes.
Witt et al. [31] coined the term ”goodness of pronunciation“ (GoP) in 2000. GoP begins by utilizing a forced-alignment approach to align the canonical phonemes with the speech signal. This method seeks to identify the most probable mapping between phonemes and areas of a voice stream. GoP then computes the ratio between the probabilities of the canonical and most probable spoken phonemes. Finally, a mispronunciation is identified if the ratio falls below a certain threshold. Deep neural networks (DNNs) [32,33,34] were added to GoP to replace the hidden Markov model (HMM) and Gaussian mixture model (GMM) approaches in audio modeling [1,15]. Cheng et al. [35] increased the performance of GoP using an unsupervised latent representation of extracted speech. The paper [36] presents the discovery and solution of mispronunciation using a specifically tailored end-to-end-based ASR structural model that incorporates a hybrid CTC/attention method [37,38,39,40]. In addition, Ref. [36] employed an anti-phone collection to create additional voice training data using a label-shuffling strategy for a unique data enhancement operation. For each sentence in the original speech training dataset, the label of the phone at each point in its reference transcript is either left untouched or randomly replaced with an arbitrary anti-phone label.
Domain speech without labels aims to improve mispronunciation detection and diagnosis (MDD) accuracy. Pseudo-labeling in the speech domain (PL) has been a popular technique for semi-supervised ASR. Within the pseudo-label generation approach, methods may be categorized as either offline or online PL. Offline PL approaches apply pseudo-labels to unlabeled samples using a teaching model developed independently. Then, a student model is trained on samples that are both labeled and pseudo-labeled [41]. It has been demonstrated that filtering heuristics [42,43] and repeated training [44] can increase PL quality. In online PL approaches, however, pseudo-labels are created in real time by the online model itself [45,46,47]. In contrast to [45], we combine PL with wav2vec 2.0 fine-tuning and phonemes as targets.
Although various mispronunciation detection models are discussed in this section, most models use autoregressive neural modeling techniques. Autoregressive neural modeling techniques use left-to-right forward computation to generate the output depending on the input or the previous block output. This will increase processing time and degrade performance when there are fewer data. In this study, we use non-autoregressive neural modeling techniques to reduce the processing time and use a pronunciation model to improve performance.

3. The Proposed Architecture

In this section, we first discuss the background information on autoregressive (AR) models and non-autoregressive (NAR) models that will be helpful for successful understanding of the proposed model.

3.1. Autoregressive Models

The autoregressive model workflow is similar to that of backpropagation neural network models, where the input will be the previous model’s output. Sometimes, the input of the next block depends on more than one previous block’s output. Such models depend on the previous output and usually require a large processing time compared to non-autoregressive models. Figure 1 shows the workflow of the autoregressive model.

3.2. Non-Autoregressive Models

NAR models can take parallel inputs and generate sequences of parallel tokens. The processing time of this model is fast because it does not depend on the previous output and can generate parallel outputs. The workflow of the non-autoregressive model is visualized in Figure 2.
We propose a novel model that combines the dictation model (DM) and the pronunciation model (PM). A non-autoregressive (NAR) end-to-end (E2E) neural or dictation model is used to increase the execution speed to the speed of traditional E2E neural models. To further enhance the performance of the proposed system, we have used the pronunciation model on top of the dictation model in our proposed architecture. In consecutive order, the workflow for experimental studies is depicted in Figure 3 below. The procedure starts with input audio data and extracts FBANK and wav2vec acoustic characteristics. FBANK and wav2vec are feature engineering models that extract acoustic features from input signals. Then, the acoustic or dictation model trains the acoustic features, and the language or pronunciation model evaluates the training data.

3.3. Dictation Model (DM)

Various automatic speech recognition (ASR) [48] methods, such as MASK-CTC, CTC-ATT, LASO, AlignRefine, etc., are used to generate dictation models [49]. This dictation approach supports non-autoregressive end-to-end modeling to shorten the execution time of end-to-end machine translation. Transformer-based end-to-end techniques derived from ASR, such as LASO (listen attentively and spell once), have been successfully used to identify and correct pronunciation errors. In this context, we investigate a unique usage of LASO [13] to develop the dictation model instead of Mask-CTC in this study. The mechanism of the dictation method is shown in Figure 4.
The main principle of the dictation method is that the phoneme feature sequence comprises characteristics for both language semantics and pronunciation. Location-wise token estimation is possible if patterns for every token location are extracted from the entire auditory feature sequence. The dictation model has three modules: encoder, position-dependent summarizer (PDS), and decoder modules. The main core fundamental for the dictation model is the multi-head self-attention (MHSA) layer. For every module, there are a different number of MHSA and feed-forward network layers. The encoder module captures the acoustic characteristics and long-term dependencies. The PDS reorganizes the acoustic sequence and summarizes the semantic relationship. The decoder module captures the sequence relationship and predicts tokens. The sequence relationship is captured in a sequence by the feed-forward attention framework [50]. The feed-forward attention mechanism employs a weighted sum to combine the input sequence, in contrast to recurrent neural networks that incorporate past context into latent vectors. It can be calculated in parallel due to its feed-forward nature. In this study, we follow [23] and employ a position-wise feed-forward network with scaled dot-product attention as the fundamental sub-module. However, for fast training, we employ the “pre-norm” [51]. Figure 5 shows the structure of the feed-forward attention block. Key (k), value (V), and query (Q) are the three inputs of an attention block. Some modules generate K and V as outputs, which are inputs to other modules, and some modules use attention blocks to create semantic relationships or dependencies between different tokens. The self-attention block calculates the dot product using Equation (1)
A t n Q , K , V = S o f t m a x ( Q K T D k )
where Q denotes queries ( Q R ( T q × D k ) ), K denotes keys ( K R ( T k × D k ) ), and V denotes values ( V R ( T k × D v ) ). T q and T k denote the sequence length of the query and the key, respectively, and D k and D v denote the key dimensionality where D v = D k .
For example, the speech signal input to the encoder block is “Gad your letter came just in time”. The multi-head attention blocks in the encoder layer generate corresponding values from queries and keys using dot products. Suppose the generated keys (K) are [0.12, 0.342, 0.453, 0.45, 0.265, 0.432, 0.653, 0.525, 0.753, 0.844, 0.854, 0.964, 0.654, 0.432, 0.455, 0.786, 0.456] and the corresponding values (V) are [Ga, ad, Gadd, Gad, you, your, let, letter, come, came, jute, just, on, in, tick, time]. The encoder output keys and values are used as inputs to the PDS block. A query in a PDS block is an input vector with a predetermined length L. Suppose L is 15 and the queries (Q) are [0.45, 0.432, 0.525, 0.844, 0.964, 0.456, 0.674] for the above example. The PDS block uses the multilayer attention head to generate a summary of the encoder output according to the token position and insert <eos> if the length of the token sequence is less than L. The length of the token sequence in the above example is eight, because it adds <sos> at the beginning of each sequence is a token indicating that the sequence has started. There are eight tokens in total, so <eos> is added seven times to make it equal to the length L of the token sequence. Finally, the output generated from the PDS block will be [<eos>, Gad, your, letter, in, came, just, in, time, <eos>, <eos>, <eos>, <eos>, <eos>, <eos >, <eos>]. Then, the decoder block removes the <sos> at the beginning of the token sequence and calculates the sequence relationship and further improves the PDS output. The decoder output for the example above is [Gad, your, letter, came, in, just, in, time, <eos>, <eos>, <eos>, <eos>, <eos>, <eos>, <eos>].

3.3.1. Encoder

Two-dimensional convolutional neural networks are used at the encoder level to catch the position of phoneme sequences and attention block capture long-term relationships between different tokens. Our initial step is to sub-sample the sequence of audio features using a two-layer convolutional neural network. Next, the outputs of the convolutional neural network are flattened into a T × D m matrix, where T is the sub-sampled feature sequence length and D m is the dimension. Localization of acoustic feature sequences is another goal of CNN. Furthermore, as seen in Figure 4, the encoder has a stack of N e attention blocks. It is a self-attention mechanism because all of the queries, keys, and values are the same. In other words, the attention scores are calculated by calculating the dot product between each pair of input vectors. The long-term dependency is therefore identified.

3.3.2. Position-Dependent Summarizer (PDS)

Then, the output of the attention score is given to the position-dependent summary (PDS) block as a key and a value. The PDS block generates a summary of the encoder output and permutations according to the token position. The position-encoding vector of various tokens is used as the query in the first attention block of the PDS module. The maximum length of the position-encoding vector is L. In order to fetch the high-level tokens from the encoder, the PDS module makes use of queries that depend on the locations of the token sequence. We refer to the module as a “summarizer” because it can be observed that it “summarizes” the acoustic properties. As a result, it fills the space between the token sequence’s length and the speech signals sequence’s length. In PDS, there are N attention blocks where the position-encoding is considered as queries for the first block, and outputs of the preceding block are considered as queries of another block as shown in Figure 4. Each inquiry represents the token position in the token sequence. All of the encoder’s outputs are the keys and the values.

3.3.3. Decoder

The decoder module calculates how many focus tokens interact with other specific tokens. The decoder further improves the PDS module representations. It is a self-attention module made up of N d attention blocks, very similar to the encoder. It captures the sequence’s relationships or implicit language semantics, which the PDS then queries. The PDS’s outputs are the decoder’s inputs. This decoder makes use of the entire utterance’s context.

3.4. Pronunciation Model (PM)

English pupils can have a broad range of utterances typically affected by their mother tongue. Furthermore, because of a scarcity of labeled dialogue corpus from English speakers for model assessment, more than a one-size-fits-all dictation model is needed to avoid overfitting problems. So, to reduce the overfitting problem, we have used an additional pronunciation model which is expected to help us determine whether each dictated phone is a valid pronunciation. Figure 6 shows the step-by-step architecture of our proposed pronunciation model. This model verifies acoustic sequences generated from dictation models with annotated phonetic sequences. In the PM model, the annotated pronunciation order is fed into the sentence embedding to generate the feature vector of the input sequence. Then, the vector is fed into the sentence encoder, combining the bidirectional long short-term memory (Bi-LSTM) and linear layers. The bidirectional recurrent natural network layer output considers the value, and the linear layer output considers the key. The attention block takes this key and value from the sentence encoder and query from the dictation model. It generates a context vector containing the monotonic alignment of the annotated and phoneme sequences. Finally, the final MDD decision is created using multi-layer feed-forward neural network and linear and softmax layers.

3.5. Learning and Inference Procedure of the Proposed Model

We use Equation (2) to minimize the position-wise cross-entropy loss for optimizing the parameters of the model.
C E θ = 1 N L j = 1 N i = 1 L log P ( y i j | x j ; θ )
where x and y are the text pair and N is the total pairs in the corpus. L is the length of the token sequence, which is predefined. If the text length is less than L, then < e o s > is used to pad the rest of the length. Moreover, θ is denoted as the trainable parameters of the model.
We only select the highest probable token at each position by using Equation (3). Then, the inserted tokens <eos> are removed from the tail of the text sequence.
y i ¯ = a r g m a x y i P ( y i | x ) , i = 1 , , L

4. Experimental Setup

In this section, we describe the experimental setup and the datasets that were utilized. The Python programming language and PyTorch deep learning tools were used to run experiments.

4.1. Dataset Description

We used three types of publicly available datasets: TIMIT [52], L2-ARCTIC [27], and SpeechOcean [28]. A detailed summary of the three datasets is shown in Table 2. An ancient dataset called TIMIT was created for ASR function, while the L2-Arctic corpus contains recordings of people speaking arctic prompts. The TIMIT [52] audio dataset was developed to provide information for the extraction of acoustic aspects and the development and evaluation of ASR systems. The TIMIT corpus has 630 speakers covering 6300 sentences and an average of 10 sentences per speaker. The TIMIT corpus contains three types of sentences: SA-Dialect, SX-compact, and SI-variety. SA-dialect has only two sentences, SX-Compact has 450 sentences, and SI-Diverse has 1890 sentences. The L2-Arctic [27] dataset is a non-native English publicly available dataset for mispronunciation research. The L2-Arctic dataset comprises 26,867 speeches from 24 male and female speakers and has 3599 annotations. All speakers are non-native speakers whose primary languages are Vietnamese, Spanish, Korean, Hindi, Chinese, and Arabic. SpeechOcean762 [28] is a recently released publicly available dataset for evaluating English pronunciation. The SpeechOcean [28] dataset contains 5000 speeches by 250 speakers, and the speakers’ primary language is Mandarin.

4.2. Experimental Setup

The proposed model used 80-dimensional Mel-filter-bank feature vectors, or FBANKs, for feature extraction, which extracts a frame of 25 ms length every 10ms. Additionally, we used wav2vec 2.0 [53] as an additional front-end feature extractor, which has shown promising results for enhancing the performance of several ASR tasks, particularly when training data are scarce. The proposed architecture was conducted for 100 training iterations. We used six attention layers for both the encoder and decoder and eight heads in each of the self-attention processes in the dictation model. In the speech model, we used a sentence encoder of size 512 as input to Bi-LSTM, while Bi-LSTM uses 384 hidden layers. The proposed model used a cross-entropy loss function for training purposes where the dropout value is 0.2. We used a single Nvidia Tesla V100 GPU to train both the dictation and the pronunciation model.

4.3. Performance Evaluation Metrics

In order to identify correct pronunciation and mispronunciation, this study used common assessment measures such as accuracy, precision (PR), recall (RE), and F1-score (F1).

4.4. Experimental Results

We used the L2-ARCTIC and SpeechOcean762 datasets to train and test the proposed model. We additionally used the TIMIT dataset for training of mispronunciation detection and diagnosis. Two feature vectors, FBANK and wav2vec 2.0, extracted fractures from different speeches in the L2-ARCTIC and SpeechOcean datasets. Figure 7 shows the training and validation accuracy of the three most-used metrics (adjusted Rand index (ARI), normalized mutual information (NMI), and accuracy) on the L2-ARCTIC and SpeechOcean datasets using the FBANK feature extraction model. For the L2-Arctic dataset, the proposed model learns smoothly from epoch 0 to 100, as shown in Figure 7a, but the training and validation curves in Figure 7b for the SpeechOcean dataset are not smooth due to misrecognition of both vowels and consonants in the training phase. However, the accuracy with the SpeechOcean dataset is higher than with the L2-Arctic dataset because SpeechOcean is a balanced dataset, whereas L2-Arctic is an unbalanced dataset.
The adjusted Rand index (ARI) metric calculates the similarity between the generated utterance sequence and the annotation sequence. The ARI score is better for the L2-Arctic dataset with both FBANK and wav2vec feature extraction models, as shown in Figure 7a, Figure 8a, and Figure 9a. The normalized mutual information (NMI) metric measures the dependence between different tokens in a voice signal. A better NMI score predicts that the order of utterances is strongly connected to different tokens. Figure 7, Figure 8 and Figure 9 show NMI curves of different datasets for both training and validation steps with 100 epochs where the highest NMI achieved for both training and validation on SpeechOcean dataset with wav2vec 2.0 (large) feature extraction model.
We select an utterance from the L2-ARCTIC [27] dataset to visualize the different heads of the last attention block in the encoder, PDS, and decoder blocks. Each attention layer consists of eight attention heads, and we have shown the first four heads in the last attention block in order to understand the model. Figure 10 shows the output of different heads at the last attention level of the encoder block, where different heads give different attention results to the same utterance. Figure 10a,c show the visualization of the speech sequence with respect to different times, whereas Figure 10b,d show the relationship between different speech sequences as acoustic frames. From these figures, we observe that most attention heads learn different patterns to calculate the final attention score. The PDS block summarizes encoder outputs and reshuffles the encoder representation according to the token positions. The PDS module uses the token sequence as the query, and the output of the encoder representation is used as the key and value. Figure 11 shows the relationship between the different tokens and different time frames.
Figure 11 shows that PDS fills the space signal sequence length with <eos> to create the specified length L of the token sequence. The different attention heads of the PDS generate different token representations, as visualized in Figure 11; the y-axis illustrates the token sequence; and the x-axis illustrates the time frame. PDS reconstructs all tokens according to token positions in different time frames. The mean of the eight attention heads is the final score of an attention block. We clearly observe that the encoder’s outputs and the relevant tokens are aligned from topmost to lower right. The alignment for the padding token <eos> is ambiguous, because there is no definite link between both the padding token and the encoder’s outputs. The overall results are not very accurate because various heads in the multi-head attention have various alignments. The decoder self-attention layer calculates the interaction of the focus words with other specific words, as shown in Figure 12. The words in the source sequence are analyzed by the decoder, which utilizes them to forecast the words in the target pattern. Figure 12 visualizes the attention score of different heads of the last attention layer in the decoder module. In Figure 12a, we show that the first attention head represents the sequence relationship of the various tokens. However, some attention heads hide the representation of previous token positions, as shown in Figure 12b,c and hides the subsequent token representation as shown in Figure 12d. In the dictation model, Equation (2) is used to minimize the position-wise cross entropy loss and ultimately remove the <eos> token from the tail and calculate the maximum possible token at each position.
We then used the edit distance algorithm to calculate the phoneme error rate (PER) from the generated and annotated phoneme sequences. Table 3 displays the phone error rate (%) of the different baseline models and the proposed model on the L2-Arctic dataset. The highest PER for the CNN-RNN-CTC baseline model is 27.75%. Our proposed model produces 20.3% PER with the FBANK feature extraction method and improves PER with the W2V-based feature extraction model, which is similar to CTC-ATT with a baseline model with a beam size of 10. The proposed model significantly improves the phone error rate for wav2vec 2.0 large, which is 14.1%. Repeated misrecognition of vowels and consonants is the major cause of increased phoneme error rates.
Table 4 shows the frequency of incorrect vowels and consonants during the pronunciation of different words. The total number of misrecognized vowels and consonants are shown as the number of targets. Our model can identify the maximum number of misrecognized characters more effectively than the existing CNN-RNN-CTC model. The most misrecognized vowel is ah, where the CNN-RNN-CTC baseline model detects 2035 instances, and the proposed model detects 2460 instances in the complete dataset, as shown in Table 4. Similarly, the letters s and t are most often misrecognized for consonant sounds.
We compared the results of different existing methods with the proposed method as shown in Table 5. The table shows results for both existing and baseline mispronunciation detection and diagnosis models built on L2-ARCTIC datasets. The mask-CTC model analyzes the results with and without the pronunciation model as shown in Table 5. The Mask-CTC model with the pronunciation model shows better F1 -scores for mispronunciation detection. Our proposed method has two sub-models: the dictation model and the pronunciation model. First, we run the dictation model on different datasets. Then, we run the proposed model on different datasets and analyze the results. In every step, the proposed model outperforms both correct pronunciation detection (CD) and mispronunciation detection (MD). Finally, we compare the performance of our proposed mode with the GOPT [54] model on the SpeechOcean dataset. The GOPT model achieved a pronunciation score accuracy of 71.4%, whereas our model outperformed existing models with correct pronunciation accuracy of 90%.

5. Conclusions and Future Work

In this work, we suggest a self-directing non-autoregressive pronunciation mistake identification model using the dictation and pronunciation model. Our proposed architecture is a combination of a dictation model and a pronunciation model. The proposed architecture used the LASO ASR approach for the dictation model and the Bi-LSTM-based approach for the pronunciation model. We used the LASO ASR approach to create a token sequence without explicit language modeling because token relationships are implicit in voice signals. On this basis, LASO forward propagates token creation in a single pass without beam searching. Additionally, the feed-forward structure makes it possible to execute parallel computing effectively and drastically lower the time cost of inference, which verifies the NAR E2E strategy’s practical viability for various prospective CAPT jobs. In future studies, we want to investigate more precise suprasegmental, prosodic, accentual, and acoustic pronunciation phenomena for NAR E2E methods for pronunciation error identification and diagnosis.

Author Contributions

Supervision, M.F.M.; conceptualization, M.A.H.W. and M.F.M.; methodology; data curation, M.F.M. and M.A.H.W.; software, M.A.H.W.; validation, M.F.M. and M.A.; formal analysis, M.A. and M.F.M.; investigation, M.A.; resources, M.F.M. and M.A.H.W.; writing—original draft preparation, M.A.H.W. and M.A.; writing—review and editing, M.A., M.A.H.W.; visualization M.A.H.W.; project administration, M.F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by the Deanship of Scientific Research at Prince Sattam Bin Abdulaziz University under the research project #(PSAU-2022/01/19401).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Li, K.; Qian, X.; Meng, H. Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 25, 193–207. [Google Scholar] [CrossRef] [Green Version]
  2. Agarwal, C.; Chakraborty, P. A review of tools and techniques for computer aided pronunciation training (CAPT) in English. Educ. Inf. Technol. 2019, 24, 3731–3743. [Google Scholar] [CrossRef]
  3. Lo, W.K.; Zhang, S.; Meng, H. Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system. In Proceedings of the Eleventh Annual Conference of the International Speech Communication Association, Makuhari, Japan, 26–30 September 2010. [Google Scholar]
  4. Harrison, A.M.; Lo, W.K.; Qian, X.J.; Meng, H. Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training. In Proceedings of the International Workshop on Speech and Language Technology in Education, Warwickshire, UK, 3–5 September 2009. [Google Scholar]
  5. Qian, X.; Soong, F.K.; Meng, H. Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT). In Proceedings of the Eleventh Annual Conference of the International Speech Communication Association, Makuhari, Japan, 26–30 September 2010. [Google Scholar]
  6. Mao, S.; Wu, Z.; Li, X.; Li, R.; Wu, X.; Meng, H. Integrating articulatory features into acoustic-phonemic model for mispronunciation detection and diagnosis in l2 english speech. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
  7. Mao, S.; Wu, Z.; Li, R.; Li, X.; Meng, H.; Cai, L. Applying multitask learning to acoustic-phonemic model for mispronunciation detection and diagnosis in l2 english speech. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6254–6258. [Google Scholar]
  8. Chen, L.; Tao, J.; Ghaffarzadegan, S.; Qian, Y. End-to-end neural network based automated speech scoring. In Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6234–6238. [Google Scholar]
  9. Leung, W.K.; Liu, X.; Meng, H. CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 8132–8136. [Google Scholar]
  10. Meng, H.; Lo, Y.Y.; Wang, L.; Lau, W.Y. Deriving salient learners’ mispronunciations from cross-language phonological comparisons. In Proceedings of the 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), Kyoto, Japan, 9–13 December 2007; pp. 437–442. [Google Scholar]
  11. Higuchi, Y.; Watanabe, S.; Chen, N.; Ogawa, T.; Kobayashi, T. Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict. arXiv 2020, arXiv:2005.08700. [Google Scholar]
  12. Kim, S.; Hori, T.; Watanabe, S. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4835–4839. [Google Scholar]
  13. Bai, Y.; Yi, J.; Tao, J.; Tian, Z.; Wen, Z.; Zhang, S. Listen attentively, and spell once: Whole sentence generation via a non-autoregressive architecture for low-latency speech recognition. arXiv 2020, arXiv:2005.04862. [Google Scholar]
  14. Chi, E.A.; Salazar, J.; Kirchhoff, K. Align-refine: Non-autoregressive speech recognition via iterative realignment. arXiv 2020, arXiv:2010.14233. [Google Scholar]
  15. Sudhakara, S.; Ramanathi, M.K.; Yarra, C.; Ghosh, P.K. An Improved Goodness of Pronunciation (GoP) Measure for Pronunciation Evaluation with DNN-HMM System Considering HMM Transition Probabilities. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 954–958. [Google Scholar]
  16. Wu, M.; Li, K.; Leung, W.K.; Meng, H. Transformer Based End-to-End Mispronunciation Detection and Diagnosis. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 3954–3958. [Google Scholar]
  17. Yan, B.C.; Wu, M.C.; Hung, H.T.; Chen, B. An end-to-end mispronunciation detection system for L2 English speech leveraging novel anti-phone modeling. arXiv 2020, arXiv:2005.11950. [Google Scholar]
  18. Yan, B.C.; Chen, B. End-to-end mispronunciation detection and diagnosis from raw waveforms. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 61–65. [Google Scholar]
  19. Bonaventura, P.; Howarth, P.; Menzel, W. Phonetic annotation of a non-native speech corpus. In Proceedings of the International Workshop on Integrating Speech Technology in the (Language) Learning and Assistive Interface, InStil; Abertay University: Dundee, UK, 2000; pp. 10–17. Available online: https://www.researchgate.net/profile/Patrizia-Bonaventura/publication/2812080_Phonetic_Annotation_of_a_Non-Native_Speech_Corpus/links/00b7d51b47a923c73d000000/Phonetic-Annotation-of-a-Non-Native-Speech-Corpus.pdf (accessed on 17 December 2022).
  20. Loewen, S. Introduction to Instructed Second Language Acquisition; Routledge: London, UK, 2014. [Google Scholar]
  21. Korzekwa, D.; Lorenzo-Trueba, J.; Drugman, T.; Calamaro, S.; Kostek, B. Weakly-supervised word-level pronunciation error detection in non-native English speech. arXiv 2021, arXiv:2106.03494. [Google Scholar]
  22. Lee, A.; Chen, N.F.; Glass, J. Personalized mispronunciation detection and diagnosis based on unsupervised error pattern discovery. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 6145–6149. [Google Scholar]
  23. Fu, K.; Lin, J.; Ke, D.; Xie, Y.; Zhang, J.; Lin, B. A full text-dependent end to end mispronunciation detection and diagnosis with easy data augmentation techniques. arXiv 2021, arXiv:2104.08428. [Google Scholar]
  24. Lee, A. Language-Independent Methods for Computer-Assisted Pronunciation Training. Ph.D Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2016. [Google Scholar]
  25. Korzekwa, D.; Lorenzo-Trueba, J.; Zaporowski, S.; Calamaro, S.; Drugman, T.; Kostek, B. Mispronunciation detection in non-native (L2) English with uncertainty modeling. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7738–7742. [Google Scholar]
  26. Korzekwa, D.; Lorenzo-Trueba, J.; Drugman, T.; Kostek, B. Computer-assisted pronunciation training—Speech synthesis is almost all you need. Speech Commun. 2022, 142, 22–33. [Google Scholar] [CrossRef]
  27. Zhao, G.; Sonsaat, S.; Silpachai, A.; Lucic, I.; Chukharev-Hudilainen, E.; Levis, J.; Gutierrez-Osuna, R. L2-ARCTIC: A non-native English speech corpus. In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6 September 2018; pp. 2783–2787. [Google Scholar]
  28. Zhang, J.; Zhang, Z.; Wang, Y.; Yan, Z.; Song, Q.; Huang, Y.; Li, K.; Povey, D.; Wang, Y. speechocean762: An open-source non-native english speech corpus for pronunciation assessment. arXiv 2021, arXiv:2104.01378. [Google Scholar]
  29. Loots, L.; Niesler, T. Automatic conversion between pronunciations of different English accents. Speech Commun. 2011, 53, 75–84. [Google Scholar] [CrossRef]
  30. Wadud, M.A.H.; Rakib, M.R.H. Text coherence analysis based on misspelling oblivious word embeddings and deep neural network. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 194–203. [Google Scholar] [CrossRef]
  31. Witt, S.M.; Young, S.J. Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun. 2000, 30, 95–108. [Google Scholar] [CrossRef]
  32. Mukhamadiyev, A.; Khujayarov, I.; Djuraev, O.; Cho, J. Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language. Sensors 2022, 22, 3683. [Google Scholar] [CrossRef] [PubMed]
  33. Wadud, M.A.H.; Mridha, M.; Rahman, M.M. Word Embedding Methods for Word Representation in Deep Learning for Natural Language Processing. Iraqi J. Sci. 2022, 63, 1349–1361. [Google Scholar] [CrossRef]
  34. Wadud, M.A.H.; Mridha, M.; Shin, J.; Nur, K.; Saha, A.K. Deep-BERT: Transfer Learning for Classifying Multilingual Offensive Texts on Social Media. Comput. Syst. Sci. Eng. 2022, 44, 1775–1791. [Google Scholar] [CrossRef]
  35. Cheng, S.; Liu, Z.; Li, L.; Tang, Z.; Wang, D.; Zheng, T.F. Asr-free pronunciation assessment. arXiv 2020, arXiv:2005.11902. [Google Scholar]
  36. Baranwal, N.; Chilaka, S. Improved Mispronunciation detection system using a hybrid CTC-ATT based approach for L2 English speakers. arXiv 2022, arXiv:2201.10198. [Google Scholar]
  37. Ren, Z.; Yolwas, N.; Slamu, W.; Cao, R.; Wang, H. Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition. Sensors 2022, 22, 7319. [Google Scholar] [CrossRef]
  38. Mridha, M.F.; Wadud, M.A.H.; Hamid, M.A.; Monowar, M.M.; Abdullah-Al-Wadud, M.; Alamri, A. L-Boost: Identifying Offensive Texts From Social Media Post in Bengali. IEEE Access 2021, 9, 164681–164699. [Google Scholar] [CrossRef]
  39. Jeon, S.; Kim, M.S. End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC. Sensors 2022, 22, 3597. [Google Scholar] [CrossRef]
  40. Wadud, M.A.H.; Kabir, M.M.; Mridha, M.; Ali, M.A.; Hamid, M.A.; Monowar, M.M. How can we manage Offensive Text in Social Media-A Text Classification Approach using LSTM-BOOST. Int. J. Inf. Manag. Data Insights 2022, 2, 100095. [Google Scholar] [CrossRef]
  41. Xu, Q.; Baevski, A.; Likhomanenko, T.; Tomasello, P.; Conneau, A.; Collobert, R.; Synnaeve, G.; Auli, M. Self-training and pre-training are complementary for speech recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3030–3034. [Google Scholar]
  42. Kahn, J.; Lee, A.; Hannun, A. Self-training for end-to-end speech recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7084–7088. [Google Scholar]
  43. Park, D.S.; Zhang, Y.; Jia, Y.; Han, W.; Chiu, C.C.; Li, B.; Wu, Y.; Le, Q.V. Improved noisy student training for automatic speech recognition. arXiv 2020, arXiv:2005.09629. [Google Scholar]
  44. Xu, Q.; Likhomanenko, T.; Kahn, J.; Hannun, A.; Synnaeve, G.; Collobert, R. Iterative pseudo-labeling for speech recognition. arXiv 2020, arXiv:2005.09267. [Google Scholar]
  45. Higuchi, Y.; Moritz, N.; Roux, J.L.; Hori, T. Momentum pseudo-labeling for semi-supervised speech recognition. arXiv 2021, arXiv:2106.08922. [Google Scholar]
  46. Zhu, H.; Wang, L.; Hou, Y.; Wang, J.; Cheng, G.; Zhang, P.; Yan, Y. Wav2vec-s: Semi-supervised pre-training for speech recognition. arXiv 2021, arXiv:2110.04484. [Google Scholar]
  47. Keya, A.J.; Wadud, M.A.H.; Mridha, M.; Alatiyyah, M.; Hamid, M.A. AugFake-BERT: Handling Imbalance through Augmentation of Fake News Using BERT to Enhance the Performance of Fake News Classification. Appl. Sci. 2022, 12, 8398. [Google Scholar] [CrossRef]
  48. Wang, D.; Wei, Y.; Zhang, K.; Ji, D.; Wang, Y. Automatic Speech Recognition Performance Improvement for Mandarin Based on Optimizing Gain Control Strategy. Sensors 2022, 22, 3027. [Google Scholar] [CrossRef]
  49. Wang, H.W.; Yan, B.C.; Chiu, H.S.; Hsu, Y.C.; Chen, B. Exploring Non-Autoregressive End-to-End Neural Modeling for English Mispronunciation Detection and Diagnosis. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6817–6821. [Google Scholar]
  50. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 5998–6008. [Google Scholar]
  51. Nguyen, T.Q.; Salazar, J. Transformers without tears: Improving the normalization of self-attention. arXiv 2019, arXiv:1910.05895. [Google Scholar]
  52. Garofolo, J.S. Timit Acoustic Phonetic Continuous Speech Corpus; Linguistic Data Consortium: Philadelphia, PA, USA, 1993. [Google Scholar]
  53. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
  54. Gong, Y.; Chen, Z.; Chu, I.H.; Chang, P.; Glass, J. Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7262–7266. [Google Scholar]
Figure 1. Workflow of the autoregressive model.
Figure 1. Workflow of the autoregressive model.
Applsci 13 00109 g001
Figure 2. Workflow of the non-autoregressive model.
Figure 2. Workflow of the non-autoregressive model.
Applsci 13 00109 g002
Figure 3. Step-by-step process of automatic pronunciation mistake identification.
Figure 3. Step-by-step process of automatic pronunciation mistake identification.
Applsci 13 00109 g003
Figure 4. Workflow of the dictation model for automatic pronunciation mistake identification.
Figure 4. Workflow of the dictation model for automatic pronunciation mistake identification.
Applsci 13 00109 g004
Figure 5. Architecture of an attention block.
Figure 5. Architecture of an attention block.
Applsci 13 00109 g005
Figure 6. Pronunciation model working procedure.
Figure 6. Pronunciation model working procedure.
Applsci 13 00109 g006
Figure 7. Visualization of the effect of training and validation scores along with the number of epochs on different datasets using the FBANK feature extraction model. (a) L2-Arctic dataset. (b) SpeechOcean dataset.
Figure 7. Visualization of the effect of training and validation scores along with the number of epochs on different datasets using the FBANK feature extraction model. (a) L2-Arctic dataset. (b) SpeechOcean dataset.
Applsci 13 00109 g007
Figure 8. Visualization of the effect of training and validation scores along with the number of epochs on different datasets using the wav2vec 2.0 (base) feature extraction model. (a) L2-Arctic dataset. (b) SpeechOcean dataset.
Figure 8. Visualization of the effect of training and validation scores along with the number of epochs on different datasets using the wav2vec 2.0 (base) feature extraction model. (a) L2-Arctic dataset. (b) SpeechOcean dataset.
Applsci 13 00109 g008
Figure 9. Visualization of the effect of training and validation scores along with the number of epochs on different datasets using the wav2vec 2.0 (large) feature extraction model. (a) L2-Arctic dataset. (b) SpeechOcean dataset.
Figure 9. Visualization of the effect of training and validation scores along with the number of epochs on different datasets using the wav2vec 2.0 (large) feature extraction model. (a) L2-Arctic dataset. (b) SpeechOcean dataset.
Applsci 13 00109 g009
Figure 10. Graphical view of the various attention head outputs at the last attention level of the encoder block. (a) Output of the first head. (b) Output of the second head. (c) Output of the third head. (d) Output of the fourth head.
Figure 10. Graphical view of the various attention head outputs at the last attention level of the encoder block. (a) Output of the first head. (b) Output of the second head. (c) Output of the third head. (d) Output of the fourth head.
Applsci 13 00109 g010
Figure 11. Graphical view of the various attention head outputs at the last attention level of the PDS block. (a) Output of the first head. (b) Output of the second head. (c) Output of the third head. (d) Output of the fourth head.
Figure 11. Graphical view of the various attention head outputs at the last attention level of the PDS block. (a) Output of the first head. (b) Output of the second head. (c) Output of the third head. (d) Output of the fourth head.
Applsci 13 00109 g011
Figure 12. Graphical view of the various attention head outputs at the last attention level of the decoder block. (a) Output of the first head. (b) Output of the second head. (c) Output of the third head. (d) Output of the fourth head.
Figure 12. Graphical view of the various attention head outputs at the last attention level of the decoder block. (a) Output of the first head. (b) Output of the second head. (c) Output of the third head. (d) Output of the fourth head.
Applsci 13 00109 g012
Table 1. List of abbreviations used in this paper.
Table 1. List of abbreviations used in this paper.
TermsDescriptionTermsDescription
AAPMArticulatory Acoustic-Phonemic ModelAPMAcoustic Phonological Model
ARAutoRegressiveASRAutomatic Speech Recognition
ARIAdjusted Rand IndexCALLComputer-Assisted Language Learning
CAPTComputer-Assisted Pronunciation TrainingDMDictation Model
DNNDeep Neural NetworkE2EEnd-to-End
ERNsExtended Recognition NetworksGMMGaussian Mixture Mode
GoPGoodness of PronunciationGOPTGoodness Of Pronunciation feature-based Transformer
HMMHidden Markov ModeLASOListen Attentively and Spell Once
L2Second LanguageMDDMispronunciation Detection and Diagnostic
MHSAMulti-Head Self-AttentionMT-APMMulti-Task APM
NARNon-AutoRegressiveNMINormalized Mutual Information
PDSPosition Dependent SummarizerPMPronunciation Model
S2SSpeech-to-SpeechT2SText-to-Speech
Table 2. Dataset summary.
Table 2. Dataset summary.
Dataset NameCategoriesSpeakersUtterancesHours
TIMITTrain63063004.5
L2-ARCTICTrain1725492.66
Test69000.88
Dev11500.12
SpeechOceanTrain12525002.28
Test6212501.14
Dev6212501.14
Table 3. Comparison of phone error rate (%) among different baseline models with the proposed model on the L2-ARCTIC dataset.
Table 3. Comparison of phone error rate (%) among different baseline models with the proposed model on the L2-ARCTIC dataset.
Model NamePhoneme Error Rate (PER) %
CNN-RNN-CTC27.75%
Mask-CTC (W2V)15.9%
Mask-CTC (FBANK)22.6%
CTC-ATT (beam size = 10)14.9%
CTC-ATT (beam size = 1)15.2%
Proposed Model (FBANK)20.3%
Proposed Model (W2V-base)14.7%
Proposed Model (W2V-Large)14.1%
Table 4. Results of most frequently misrecognized vowel and consonant.
Table 4. Results of most frequently misrecognized vowel and consonant.
Misrecognized Vowels
Modelsiyihehaeahaa
Target Numbers213339701260215553214874
CNN-RNN-CTC109712565446302035231
Proposed Model112414877207802460350
Misrecognized Consonants
Modelszsshtdhd
Target Numbers11604822844783314482420
CNN-RNN-CTC228145731312801251067
Proposed Model340145634015001801288
Table 5. Performance evaluation on correct pronunciation and incorrect pronunciation detection.
Table 5. Performance evaluation on correct pronunciation and incorrect pronunciation detection.
Model NamesCorrect-Pronunciation
Detection (CD)
Mispronunciation
Detection (MD)
PRREF1PRREF1
Mask-CTC (w/o PM)91.6290.9491.2845.7347.8746.77
Mask-CTC (w/PM)91.7090.8091.2545.6648.4647.02
CTC-ATT91.0492.2491.7347.5542.9445.13
GOP91.9790.9891.4746.9950.1548.52
CNN-RNN-CTC93.8879.9786.3734.8867.2945.94
Proposed Model (w/o PM) [L2-ARCTIC]91.9392.9591.6347.8248.9348.01
Proposed Model (w/PM) [L2-ARCTIC]92.0392.4591.8247.9549.3248.65
Proposed Model (w/o PM) [SpeechOcean]93.4592.9592.6755.2150.5852.07
Proposed Model (w/PM) [SpeechOcean]94.2893.6293.0155.8552.0651.74
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wadud, M.A.H.; Alatiyyah, M.; Mridha, M.F. Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection. Appl. Sci. 2023, 13, 109. https://doi.org/10.3390/app13010109

AMA Style

Wadud MAH, Alatiyyah M, Mridha MF. Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection. Applied Sciences. 2023; 13(1):109. https://doi.org/10.3390/app13010109

Chicago/Turabian Style

Wadud, Md. Anwar Hussen, Mohammed Alatiyyah, and M. F. Mridha. 2023. "Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection" Applied Sciences 13, no. 1: 109. https://doi.org/10.3390/app13010109

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop