**4. Experimental Results**

#### *4.1. Dialogue Exercise Result*

To validate the effectiveness of the proposed method, we performed computer-assisted fluency scoring experiments with spoken English sentences collected in dialogue scenarios of the GenieTutor system. Figure 7 shows an example of a role-play scenario and fluency scores feedback with the proposed method. Once the learner completes a sentence utterance, the system computes several aspects of pronunciation evaluation and displays them in diagram forms. Learners can check their fluency scores by selecting the sentences they want to check. Learners are provided with overall feedback after finishing all conversations. As shown in Figure 7, the proposed method can efficiently compute the intonation curves and stress patterns of the sentences uttered by the learner even when pronunciation errors occur. In addition, the error words are marked in red, so the learner can see the error parts.

**Figure 7.** Example of dialogue exercise and fluency scores of the learner and the native speaker with the proposed method: (**a**) example of a role-play dialogue exercise, (**b**) fluency scores feedback.

#### *4.2. Proficiency Evaluation Test*

## 4.2.1. Speech Database

We also performed the proficiency evaluation test using the rhythm and stress scores with other fluency features. A speech dataset was selected from the English read speech dataset read by non-native and native speakers for the spoken proficiency assessment. The dataset is a corpus of English speech sounds spoken by Koreans and 7 American English native speakers (references) for experimental phonetics, phonology, and English education, and is designed to see Korean speakers' intonation and rhythmic patterns in English connected speech and the errors which Korean speakers are apt to make in pronunciation of segments. Each utterance was scored by human expert raters on a scale of 1 to 5. In this study, the gender and spoken language proficiency levels were evenly distributed among the speakers. Table 3 shows scripts samples. The speech dataset comprised 100 non-native speakers, and for each speaker, 80 sentences were used for training and another 20 sentences, not included in the training dataset, were used for testing.

**Table 3.** Samples of the scripts.


For speech conversion and augmentation, an additional 7 American English native speakers (3 males and 4 females), and for each speaker, 100 sentences, were used, and frame alignment of the dataset was not performed. We used the WORLD package [49] to perform speech analysis. The sampling rate of all speech signals reported in this paper was 16 kHz. The frame shift length was 5 ms and the number of fast Fourier transform (FFT) points was 1024. For each extracted spectral sequence, 80 Mel-cepstral coefficients (MCEPs) were derived.

#### 4.2.2. Human Expert Rater

Each spoken English sentence uttered by non-native learners was annotated by four human expert raters who have English teaching experience or are currently English teachers. Each non-native utterance was rated for five proficiency area scores: holistic impression of proficiency, intonation, stress and rhythm, speech rate and pause, and segmental accuracy. In addition, each proficiency score was measured on a fluency level scale of 1–5. A holistic score for each utterance is calculated as an average of all proficiency scores and used for proficiency evaluation in this paper. Table 4 shows a mean of the correlation between human expert raters' holistic scores.



## 4.2.3. Data Augmentation

The proposed VAE-based speech conversion model consisted of a content encoder, style encoders, and a joint decoder. The content encoder comprised two dilated convolutional layers and a gated recurrent unit (GRU) based on a recurrent neural network. In order to remove the speech style information, all convolutional layers were followed by instance normalization (IN) [50]. The style encoder comprised a global average pooling layer,

3-layer multi-layer perceptron (MLP), and a fully connected layer. In the style encoder, IN was not used because it removes the original feature mean and variance that represent speech style information. Then, content and style factors were fed into the decoder to reconstruct or convert the speech. The decoder comprised two dilated convolutional layers and the recurrent neural network-based GRU. All convolutional layers were used with an Adaptive Instance Normalization layer generated by the MLP from the style factor [50].

$$AdaIN(z,s) = \sigma(s)\left(\frac{z-\mu(z)}{\sigma(z)}\right) + \mu(s),\tag{7}$$

where *z* is the activation of the previous convolutional layer, and *μ*(.) and *<sup>σ</sup>*(.) denote the mean and variance, respectively.

Figure 8 shows an example of Mel-spectrograms obtained by the proposed method. Comparing the decoding results, we confirmed that the proposed method reconstructs and converts the spectral features efficiently.

**Figure 8.** Waveform and Mel-spectrograms. (**a**) Waveform of the input signal, (**b**) Mel-spectrogram of the input signal, (**c**) reconstructed Mel-spectrogram, and (**d**) converted Mel-spectrogram.

We performed the perception test to compare the sound quality and speaker similarity of converted speech between the proposed VAE-based speech conversion method and the conventional conditional VAE-based speech conversion (CVAE-SC) method [29], which is one of the most common speech conversion methods. We conducted an AB test and an ABX test. "A" and "B" were outputs from the proposed method and the CVAE, and "X" was a real speech sample. To eliminate bias in the order, "A" and "B" were presented in random orders. In the AB test, each listener was presented with "A" and "B" audios at a time, and was asked to select "A", "B", or "fair" by considering both speech naturalness and intelligibility. In the ABX test, each listener was presented with two audios and a reference audio "X", and then, was asked to select a preferred audio or "fair" by considering the one closer to the reference. We used 24 utterance pairs for the AB test and another 24 utterance pairs, not included in the AB test, for the ABX test. The number of listeners was 20. Figure 9 shows the results, and we confirmed that the proposed method outperforms the baseline in both sound quality and speaker similarity terms.

We also performed the speech recognition test to validate that the spectral features were converted meaningfully using the English read speech dataset. We used the ESPnet [51] for an end-to-end ASR system. We trained the AM using only the training dataset ("Train database only" in Table 5) and evaluated the test dataset, and we compared the recognition results to those obtained by evaluating the same test dataset using the AM trained with the augmented dataset ("Augmentation" in Table 5). Table 5 shows the word error rate (WER) results. For comparison, SpecAugment [21], speed perturbation method [20], and CVAE-SC were used as a reference. As shown in Table 5, we confirmed that the data augmentation with the proposed method improves the speech recognition accuracy for all proficiency score levels compared to a method employing conventional AM and the other augmentation methods. By sampling different style factors, the proposed speech conversion method is able to generate diverse outputs, but the computational complexity is higher than that of other methods.


**Table 5.** Speech augmentation and word error rate (%) results.

#### 4.2.4. Features for Proficiency Scoring

All features for proficiency scoring are computed based on the time-aligned phone sequence and its time information [11,12,14]. Table 6 shows the proficiency scoring feature list used to train the automatic proficiency scoring models in this work.



**Table 6.** *Cont.*


#### 4.2.5. Proficiency Scoring Model

We used two modeling methods: (1) multiple linear regression (MLR) and (2) deep neural network, to train scoring models with high agreemen<sup>t</sup> with human expert raters. MLR is simple and has been used for a long time for automatic proficiency scoring purposes. Based on the MLR scoring model, the proficiency score is computed as follows:

$$Score = \sum\_{i} \alpha\_{i} \cdot f\_{i} + \beta\_{i} \tag{8}$$

where *i* is the index of each feature, *αi* is the weight associated with each scoring feature *fi*, and *β* is a constant intercept.

We also used a neural network to train the proficiency scoring model nonlinearly and more accurately. The neural network comprised a convolutional layer with 1 hidden layer and 3 hidden units and a fully connected layer. Given 41 features, the neural network trains the proficiency scoring model.

#### 4.2.6. Proficiency Evaluation Results

In order to validate that the proposed automatic proficiency evaluation system measured the proficiency scores effectively and meaningfully, we computed and compared a Pearson's correlation coefficient between the proficiency scores of the proposed system and those of human raters. The Pearson's correlation coefficient is a commonly used metric for evaluating the performance of proficiency assessment methods [52–54]. Tables 7 and 8 show the proficiency evaluation results obtained by the proposed method with and without data augmentation. For comparison, the range of correlation coefficients of the inter-rater scores ("Human" in Tables 7 and 8) were used as a reference. As shown in Tables 7 and 8, we confirmed that the proposed automatic proficiency evaluation method measures proficiency scores efficiently for all proficiency area scores. In addition, we confirmed that data augmentation for AM training with the proposed speech conversion method improves the averaged correlation performance for all proficiency area scores compared to the method employing conventional AM trained without data augmentation. By automatically evaluating the proficiency of the L2 speaker's utterance, the proposed proficiency scoring system is able to perform fast and consistent evaluation in various environments.

**Table 7.** Correlation between human rater and proposed proficiency scoring system without data augmentation.



**Table 8.** Correlation between human rater and proposed proficiency scoring system with data augmentation.

#### **5. Conclusions and Future Work**

We proposed an automatic proficiency evaluation method for L2 learners in spoken English. In the proposed method, we augmented the training dataset using the VAE-based speech conversion model and trained the acoustic model (AM) with an augmented training dataset to improve the speech recognition accuracy and time-alignment performance for non-native speakers. After recognizing the speech uttered by the learner, the proposed method measured various fluency features and evaluated the proficiency. In order to compute the stress and rhythm scores even when the phonemic sequence errors occur in the learner's speech, the proposed method aligned the phonemic sequences of the spoken English sentences by using the DTW, and then computed the error-tagged stress patterns and the stress and rhythm scores. In computer experiments with the English read speech dataset, we showed that the proposed method effectively computed the error-tagged stress patterns, stress scores, and rhythm scores. Moreover, we showed that the proposed method efficiently measured proficiency scores and improved the averaged correlation between human expert raters and the proposed method for all proficiency areas compared to the method employing conventional AM trained without data augmentation.

The proposed method can also be used for most signal processing and generation problems, such as sound conversion between instruments or generation of various images. However, the current style conversion framework has a limitation that the conversion model learns the domain-level style factors and generates the converted speech signal rather than diverse pronunciation styles of multiple speakers included in each domain. In order to learn more meaningful and diverse style factors and perform many-to-many speech conversion, we plan to address the issues of automatic speaker label estimation and expansion to each speaker-specific style encoder in the future work.

**Author Contributions:** Conceptualization, methodology, validation, formal analysis, writing—original draft preparation, and writing—review and editing, Y.K.L.; supervision and project administration, J.G.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by an Electronics and Telecommunications Research Institute (ETRI) gran<sup>t</sup> funded by the Korean governmen<sup>t</sup> (21ZS1100, Core Technology Research for Self-Improving Integrated Artificial Intelligence System), and by the Institute of Information & Communications Technology Planning & Evaluation (IITP) gran<sup>t</sup> funded by the Korean Government (MSIT) (2019-0-2019-0-00004, Development of semi-supervised learning language intelligence technology and Korean tutoring service for foreigners).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.
