An Investigation of Applying Large Language Models to Spoken Language Learning

Gao, Yingming; Nuchged, Baorian; Li, Ya; Peng, Linkai

doi:10.3390/app14010224

Open AccessArticle

An Investigation of Applying Large Language Models to Spoken Language Learning

¹

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Department of Linguistics, The University of Texas at Austin, Austin, TX 78712, USA

³

NetEase Youdao, Beijing 100193, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(1), 224; https://doi.org/10.3390/app14010224

Submission received: 20 November 2023 / Revised: 21 December 2023 / Accepted: 23 December 2023 / Published: 26 December 2023

(This article belongs to the Special Issue Natural Language Processing: Novel Methods and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

People have long desired intelligent conversational systems that can provide assistance in practical scenarios. The latest advancements in large language models (LLMs) are making significant strides toward turning this aspiration into a tangible reality. LLMs are believed to hold the most potential and value in education, especially in the creation of AI-driven virtual teachers that facilitate language learning. This study focuses on assessing the effectiveness of LLMs within the educational domain, specifically in the areas of spoken language learning, which encompass phonetics, phonology, and second language acquisition. To this end, we first introduced a new multiple-choice question dataset to evaluate the effectiveness of LLMs in the aforementioned scenarios, including the understanding and application of spoken language knowledge. Moreover, we investigated the influence of various prompting techniques such as zero- and few-shot methods (prepending the question with question-answer exemplars), chain-of-thought (CoT) prompting, in-domain exemplars, and external tools. We conducted a comprehensive evaluation of popular LLMs (20 distinct models) using these methods. The experimental results showed that the task of extracting conceptual knowledge posed few challenges for these LLMs, whereas the task of application questions was relatively difficult. In addition, some widely proven effective prompting methods combined with domain-specific examples resulted in significant performance improvements compared to the zero-shot baselines. Additionally, some other preliminary experiments also demonstrated the strengths and weaknesses of different LLMs. The findings of this study can shed light on the application of LLMs to spoken language learning.

Keywords:

large language models; prompt engineering; computer-assisted language learning; spoken language learning; spoken language intelligence

1. Introduction

Spoken language learning is the process of acquiring the ability to communicate verbally in a new language. It involves developing skills such as listening, speaking, pronunciation, vocabulary, grammar, and discourse [1]. Spoken language learning can be accomplished through diverse methods, including formal instruction, self-study, immersion, interaction, and technology. One of the approaches to integrating technologies into spoken language learning is computer-assisted pronunciation training (CAPT), a subfield of computer-assisted language training (CALL) [2,3]. It mainly involves the assessment of pronunciation errors and the detection and correction of prosody errors. Most current systems focus on the pronunciation of phonemes. Evaluating prosody has always been a challenging task since precise prosody can only be comprehended by grasping the context, which was beyond the capability of previous models [4]. To help language learners develop their spoken language skills, it is necessary to have an open and unconstrained language learning system that allows them to freely express themselves. For example, the content for pronunciation training does not need to be pre-set, allowing users the freedom to choose and switch between topics at their leisure.

The highly parallelable Transformer architecture [5], with its massively parallel computation hardware, and the self-supervised learning technique promise to leverage a vast amount of raw data (e.g., text, images, audio) to learn general-purpose deep contextualized representations [5,6,7]. These pre-trained context-aware representations are now ubiquitous in natural language processing and very effective as general-purpose semantic features, which have largely improved the performance of natural language processing (NLP) tasks [8,9]. Large language models (LLMs) are poised to make this dream a reality due to their ability to evaluate the emotions, prosody, and even rhythm of a sentence in any given context.

1.1. Related Works

Large Language Models (LLMs): Language models have revolutionized NLP in recent years. Researchers have found that enlarging a Pre-trained Language Model (PLM) (e.g., model parameters) often leads to better performance [10,11]. A number of studies have explored pushing the limits of performance by training ever larger PLMs (e.g., the 175B-parameter GPT-3 [12] and the 540B parameter PaLM [13]). A remarkable success story in LLMs is ChatGPT (https://openai.com/blog/chatgpt/, accessed on 11 August 2023), developed by OpenAI. It adapts LLMs from the GPT series for dialogue, presenting amazing conversation abilities with humans. This triggered community-wide enthusiasm, driving the construction of various imaginative and fantastic applications. Microsoft AI scientists have recently explored the capabilities of OpenAI’s GPT-4 [14], a more powerful large language model, claiming that GPT-4 demonstrates “sparks” of human-level intelligence or artificial general intelligence (AGI) [15]. There have been a number of attempts to assess ChatGPT and other LLMs from different angles, including factors such as natural language tasks, reasoning, robustness, trustworthiness, ethical considerations, and specific applications (e.g., natural science, social science, engineering, medical application, education, etc.) [16,17,18,19,20,21,22]. The broad applicability of LLMs underscores the evaluation of emerging intelligence within expert domains.

Spoken Language Intelligence (SLI): Our world is inherently multimodal, and we engage with our environment through a variety of mediums, including text, images, sounds, and sensory experiences. There exist numerous multimodal methods that exhibit exceptional problem-solving capabilities across various scenarios, with text-based semantics frequently serving as the key agent, either explicitly or implicitly [23,24]. For example, we aspire to empower machines to comprehend an image or video and articulate its meaning in words [25]. Conversely, we create specialized text prompts to generate images with the creativity of professional painters [26]. In the realm of speech, automatic speech recognition (ASR) technology is capable of extracting corresponding text from speech signals, even in complex speaking scenarios. Conversely, Text-to-Speech (TTS) systems can realistically produce speech sounds that correspond to a given piece of text, mimicking the nuances of human speech. The powerful capabilities of LLMs make it easy to integrate these systems, for example, to build voice assistants. Owing to the unified Transformer structure, in addition to simple cascading, they can be completely integrated in an end-to-end manner [27]. LLMs have the ability to replace the LM module in ASR, decode discretized representations directly, or even replace the intricate TTS text frontend, resulting in more expressive speech generation [28]. Recently, AudioPaLM was created [29] by fusing text- and speech-based language models. It inherited the capability to preserve paralinguistic information and intonation from AudioLM [30] and the linguistic knowledge present in text LLMs of PaLM-2 [31]. It outperformed current systems in speech translation tasks and demonstrated the capability to conduct zero-shot speech-to-text translation across numerous languages.

Prompt Engineering: Prompts provide a natural and intuitive way for humans to interact with LLMs, allowing users to design and supply tailored prompts to guide LLMs in producing preferred responses or accomplishing specific tasks. A typical prompting method is In-Context Learning (ICL) [12], which utilizes natural language text to formulate task descriptions and/or demonstrations, enabling LLMs to recognize and perform new tasks by learning from a few given examples. Furthermore, to further improve ICL, chain-of-thought (CoT) prompting [32] incorporates a sequence of intermediate reasoning steps into prompts. Rather than merely formulating the prompts with input-output pairs as in ICL, CoT incorporates intermediate reasoning steps that guide the LLM’s reasoning process from the input to the final output within the prompts. Once appropriately designed, CoT prompts can effectively stimulate the reasoning skills of LLMs. In fact, using diverse CoTs (i.e., sample multiple reasoning paths for one problem) has been shown to be a direct and effective approach for enhancing their performance [33]. Moreover, LLMs are not limited to their internal knowledge and can use external tools when needed. Previous research has demonstrated that the use of API calls to integrate various tools, such as search engines, calculators, and compilers, improves the performance of LLMs in specific tasks [34,35,36].

Deploying LLMs: For practical applications of LLMs, it is essential to incorporate additional safeguards. They are likely to magnify social biases ingrained in their training data and may produce inaccurate, toxic, biased, or even harmful information [37]. In reality, similar issues have been observed in various language evaluation scenarios, such as native-speaker judgment of the speech of non-native speakers, which is notoriously biased. This phenomenon is known as reverse linguistic stereotyping, whereby a person’s speaking performance is evaluated based on the stereotypes associated with their social identity [38]. Therefore, deploying LLMs in sensitive areas such as education must be approached with great care [39,40]. While LLMs have been evaluated on a variety of benchmarks, such as MMLU [17] and BIG-bench [41], further studies are necessary to apply them to the domain of spoken language learning.

We acknowledge that LLMs can assist CALL by helping language teachers create lesson materials or providing students with explanations of grammar and vocabulary in an adaptive and personalized manner [39]. They also have the potential to exhibit spoken language intelligence (SLI), i.e., the ability of a system or technology to understand and process spoken language by leveraging speech recognition, NLP, or conversational AI. These technologies can enable machines to comprehend, interpret, and respond to spoken language in a manner that resembles human-like intelligence. However, it is still questionable whether their capabilities can match the expertise of phonetics specialists or serve as effective language learning assessors. To begin exploring this question, we propose a challenging sub-question: Do LLMs possess adequate spoken language intelligence to handle reasoning questions that require the expertise of human phonetic professionals?

1.2. The Contributions of This Study

This paper investigates the performance, interpretability, and limitations of LLMs in SLI within the field of spoken language learning. We have collected a composite dataset comprising a set of concepts mainly designed to test the large models’ knowledge of spoken language, along with application questions tailored for industrial production. We examine a series of LLMs in our study, which is conducted in two rounds. We comprehensively review their performance on a large scale and consider two prompting strategies—direct and CoT prompting—under both zero- and few-shot learning paradigms. In the second round, we meticulously analyze representative models using advanced optimized methods. The main contributions of this study are as follows:

This study introduces a dataset on spoken language intelligence that serves as a substantial benchmark for speech and language learning scenarios, providing valuable supplementation to existing benchmarks.
This study conducts a study on various prompting strategies (such as zero-shot, few-shot, direct/CoT, domain-specific exemplar, and external tools) and analyzes their performance using multiple-choice questions.
This study demonstrates that in-domain example sampling techniques can consistently improve performance on domain-specific data.
This study conducts an expert evaluation of a small set of multi-turn conversations generated by GPT-3.5. Unless otherwise specified, the results mentioned in this paper regarding GPT-3.5 are based on the GPT-3.5-turbo-0613 model. Error analysis indicates that GPT-3.5 has the potential to enhance conversational spoken language learning.

2. Methodology

2.1. Dataset

Even though benchmarks for general abilities and some professional fields are widely used, there is still a scarcity of evaluation datasets on SLI for language learning. We introduce a new dataset called SLIQ-LL (Spoken Language Intelligence Questions for Language Learning) (available on Github: https://github.com/vocaliodmiku/SLI-LL, accessed on 11 August 2023). This dataset covers the topics of phonetics, phonology, and second language acquisition, which are frequently addressed in language learning education. The dataset consists of two subsets:

Knowledge and Concept: We designed a set of concept-related questions to test the large models’ knowledge of spoken languages, such as “what is language transfer” and “how many classifications of consonants are there for the manner of articulation?’’ These questions were mainly sampled from the exercises at the end of each chapter of the book [42] and then manually adjusted to the needs of the research.
Application Questions: To address the ever-changing nature of personalized problems, it is necessary to utilize knowledge of phonetics or linguistics to make complex reasoning. For example, different contexts call for different appropriate stress patterns, which is fundamental to an automatic personalized language learning system. Therefore, based on language teaching practices involving pronunciation, pauses, stress, and intonation, we manually designed a series of representative questions. For example, regarding stress, we formulated questions about the placement of word stress within words, phrase stress when conveying specific meanings, and sentence stress. For the issue of word stress within words, we considered having GPT determine the positions of stressed syllables in both regular and irregular words (e.g., the word “present” has different stress locations depending on the part of speech). For the issue of intonation, we designed questions that would allow the LLMs to determine the intended meanings of sentences with different positions of rising/falling tone or vice versa, i.e., allowing LLMs to determine the proper intonation for sentences with specific meanings. Overall, we aimed to increase the breadth of questions and enhance their practical relevance as much as possible. An example about pronunciation is shown in Figure 1.

We collected and designed a total of 445 questions, including 144 knowledge and concept-related questions and 301 application questions. Figure 2 (left) shows the type distribution of the application questions. In the context of language learning, during our initial preparations, we found that intermediate-level learners need to focus more on fundamental pronunciation issues, as opposed to suprasegmental features like stress, break, and intonation. To present such data, we formulated multiple-choice questions, each with only one correct answer. We uniformly assigned their answers, and Figure 2 (right) shows the statistics for these questions.

2.2. Prompting Methods

This paper delves into various prompt engineering methods for spoken language question answering, with the goal of attaining a thorough grasp of the genuine performance of diverse models. A basic introduction, a comparison, and various applications of different prompting methods can be found in [44]. The prompt templates we created in this study are summarized in Table 1.

Zero-shot setting: We adopt the zero-shot setting as the benchmark for our experiments. This approach involves simply posing a question and requesting an answer, which is the most straightforward and widely used way to leverage LLMs. However, a foundation model that has not been fine-tuned on any task or instruction fine-tuned, particularly one with a limited parameter size, may not generate meaningful responses to user inputs. Nonetheless, as our test data are presented in the form of multiple-choice questions, which aligns with many LLM benchmarks [17,45,46], this technique is relatively user-friendly and practical [47].

Few-shot setting: We insert multiple examples in both the task description and the question–answer pairs, each of which is structured based on a zero-shot setting.

CoT: We incorporate reasoning chains in the prompts by adding “Let’s think step by step. <CoT>” to the end of the prompt in the zero- or few-shot setting. This is followed by a statement to draw the conclusion, “Therefore, the answer is <answer>”. Therefore, another two settings can be constructed: zero-shot CoT and few-shot CoT.

In-domain exemplar: When providing exemplar samples for each question, we select more relevant ones based on the question type. This operation is applied to the few-shot CoT setting.

Self-consistency: CoT prompting offers more options to deduce the answer to a given question, with a primary focus on generating multiple lines of reasoning and striving to find a consensus among the resulting answers (such as selecting the most consistent answer through voting among these paths) [33,48]. The work in [49] showed that diverse reasoning paths are a critical factor in improving CoT reasoning performance. The integration of self-consistency into CoT prompting can readily improve performance with no need for additional training. Here, we utilize multiple answers generated by CoT prompting to obtain self-consistency results through majority voting.

Tool augmentation: Although LLMs can memorize some of the knowledge ingrained in the training data, they may still struggle to utilize relevant external knowledge efficiently during inference. To enhance the performance of language models, a branch of research that utilizes external tools and the retrieval of relevant information from a knowledge base has been considered [34,35,36,50,51]. We utilize Google and Wikipedia as external tools to assist in answering the questions. We use the langchain toolkit (https://github.com/langchain-ai/langchain, accessed on 11 August 2023) to conduct the experiment, and this strategy is only examined in the zero-shot learning setting.

3. Experiments and Results

3.1. Experimental Setup

We evaluated the performances of 20 different LLMs on the SLI-LLM dataset, including four model families: GPT-3.5 and GPT-4 [14] (here, the API version we used is GPT-4-0613), LLaMA 1 and LLaMA 2 [52,53], FLAN-T5 and UL2 [54,55], and Pythia [56]). Table 2 shows the details of the LLMs used in this study.

We conducted our experiments in two rounds. In the first round, we performed a comprehensive performance comparison on a large scale. We considered two prompting strategies, including direct and CoT prompting, under both zero- and few-shot (three-shot) learning paradigms. In the second round, we carefully analyzed the representative models using advanced prompting methods. All reasoning was performed using the parameters {max_new_tokens: 512, temperature: 0, top_p: 1} to obtain the most deterministic results. We wrote five development examples with thought chains for the concept-related and application questions, respectively, and unless specified, the few-shot examples were sampled from these five questions.

3.2. Zero-Shot and Few-Shot Benchmarks

Table 3 presents the full results in terms of accuracy of the first round, in which we only compared the experimental settings of the zero-shot, few-shot (

k = 3

), and CoT configurations. The accuracy was calculated by counting the number of correct answers of the LLMs and then dividing it by the total number of questions. The

Δ

column indicates the performance gap between a specific prompting method and the zero-shot configuration. The values in the “overall” column were calculated by dividing the total number of correct answers by the total number of questions regardless of the question type. For convenience of observation, Figure 3 demonstrates the model’s ability by showing the best performance among the four prompting methods for each model. That is to say, Figure 3 shows the highest accuracy among the four prompting methods, as selected from the last column of Table 3. It should be noted that the results in the figure are the raw results minus 25% (random selection level). Overall, models with larger parameter sizes exhibited better overall performance. Despite this, Pythia struggled with generating reasonable responses, and the performance of its best prompting strategy was still below the level of random guessing. In contrast, GPT-4 displayed exceptional performance with a significant advantage. Surprisingly, Flan-T5 exhibited comparable performance to models several times its parameter size, and the 70 B version of LLaMA2 also achieved comparable performance to GPT-3.5. The best performance among the open source models was achieved by LLaMA2-70B with an accuracy of 64%, highlighting the potential for further optimization. In addition to the model sizes of the LLMs, the scale and diversity of pre-trained data also had a significant impact on their performance. As we can see in Figure 3, LLaMA2-70B exhibited relatively better performance compared to the other LLMs, except for the GPT-3.5 and GPT-4 models. This is probably because it was pre-trained with 2.0 T tokens of data, which is much bigger compared to the other models. It should be noted that LLMs with less pre-trained data exhibited better performance. The better performance obtained by GPT-3.5 is probably due to its relatively large model size and the diversity of the training data, although it was pre-trained with only 300 B tokens. The model size and pre-trained data scale of GPT-4 were unavailable, but it was fine-tuned using instruction tuning (IT) and reinforcement learning with human feedback (RLHF). Therefore, it is reasonable to speculate that optimal performance was achieved with a combination of model size, scale and diversity of pre-trained data, subsequent fine-tuning, and other possible related factors. To gain a more precise understanding of these models’ performance, we conducted a more in-depth analysis of various aspects, as outlined below.

Performance and Stability: Models with more parameters tended to exhibit better performance and stability. To examine their reliability and stability, we first grouped the large language models into five groups in terms of their model parameters and then drew the box plots separately for the “Knowledge & Concept” and “Application Questions” subsets, which are plotted in Figure 4. The values used to create the box plots were taken from the “Concept (144)” and “Applied Questions (301)” columns in Table 3, respectively. As we examine the two data subsets more closely (as depicted in Figure 4), we can draw similar conclusions as above. Notably, regarding the Knowledge and Concepts subset, models with parameter sizes exceeding 20B displayed reduced performance variation, signifying heightened stability in their performance.

Concept Memorization and Knowledge Reasoning: The LLMs excelled in concept memorization but exhibited a weaker ability to apply knowledge for reasoning. Even on a relatively small model (7 B), the accuracy of concept memorization reached nearly 80%, and a model with a size of around 11 B reached the level of GPT-3.5. At around 20 B, the LLMs reached performance saturation in the knowledge and concept-related questions. However, in reasoning the application questions, even the most powerful LLaMA2 models with sizes of 70B, as well as the GPT series models, performed relatively poorly, achieving low accuracy (42.6% and 64.3%, on average, respectively).

Knowledge Preference: We analyzed whether these models exhibited any specific preferences for certain types of knowledge. Figure 5 demonstrates that these models exhibited no significant differences in terms of accuracy among the different types of questions.

Answer Bias: We selected several models and conducted a corresponding distribution analysis of the generated answers, as shown in Table 4. It can be observed that apart from GPT-3.5 and GPT-4, the other models exhibited apparent answer bias. It is well accepted that LLMs are likely to generate toxic, biased, or even harmful content for humans because their training procedures mainly aim to capture the characteristics of pre-training datasets. Among the LLMs used in this study, GPT-3.5 was fine-tuned with RLHF, enabling it to follow humans’ expectations. Therefore, the answers of GPT-3.5 were more consistent with human values.

3.3. Analysis of Advanced Prompting Methods

In Section 3.2, we utilized an empirical value of

k = 3

as the number of shots for large-scale experiments. In this section, we explore the performance of different prompting methods for several representative LLMs with different numbers of shots.

Few-shot and CoT prompting: Figure 6 presents the results of direct few-shot and CoT few-shot prompting using different numbers of shots. Based on the results, we can see that increasing the number of examples improved performance to a limited extent, and increasing the examples of the reasoning chain had a more significant and stable effect on models above 70 B (LLaMA2-chat, GPT-4). However, for smaller models, these prompts exceeded their capabilities, resulting in a degradation in performance.

In-domain prompts v.s. out-of-domain prompts: We used prompts from different domains for two models capable of responding to CoT prompts. For the in-domain cases, we used domain-specific prompts for each type of question. In Figure 7, it is evident that the in-domain approach has significant advantages compared to the out-of-domain one, where the examples were selected from the more common MMLU dataset [17].

Self-consistency: Although the answers to the SLI questions did not have as many reasoning paths as mathematical reasoning questions, we found that self-consistency improved the performance of the GPT-3.5 model (as shown in Table 5). However, for LLaMA2-70B-chat, its occasional cleverness was offset by multiple generated errors.

Augmented language models: On the Internet, individuals actively share their language learning experiences. By effectively using this external knowledge, LLMs can improve credibility and mitigate hallucination issues. In this study, we examined the LLMs’ ability to use external knowledge to improve SLI by allowing auxiliary tools to be used by GPT-3.5. Experiments on the zero-shot setting showed that GPT-3.5 called Google and Wikipedia searches 351 and 131 times, respectively, when answering the application questions. However, its performance was not improved compared to the case without using external tools, with both of them achieving an accuracy of 49.1%. One piece of good news is that the models using the tools could recognize their limitations and refuse to answer questions they were uncertain about, although this ability still seems relatively limited (see Table 6).

3.4. Experts’ Evaluations of the Multi-Turn Conversations

Compared to single-turn Q&A, people prefer interacting through dialogue interfaces. Here, we performed a more challenging evaluation in the context of language learning, specifically CAPT. From a private language learning dataset, we selected 20 English utterances produced by Chinese learners, the lengths of which were between 5 and 15 words. The speakers had different levels of pronunciation proficiency, and they were aged between 15 and 30 years old. We used the forced-alignment function provided by the Kaldi toolkit [60] to obtain the GOP [61] score as the phoneme score and, subsequently, the word score by averaging the phoneme scores. The pronunciation realizations related to phonemes, fluency, break, stress, and intonation were annotated by human experts. The evaluation experiment was conducted as follows. Firstly, we gave the LLMs a system prompt, where the data structure of the CAPT results was explained (see the initial system prompt example in Table 7). Then, the descriptions of real pronunciation realizations, annotated by human experts, of a specific English utterance were input to the LLMs as the current object of analysis. Next, a series of questions were posed, including “please tell me whether the pronunciation is clear, accurate and natural”, “Please tell me how to pronounce it correctly”, “Please provide two examples to correct my mistake”, etc. This evaluation was designed to assess the ability of the model to analyze and reason a given question using knowledge acquired through phonetics and second language acquisition in a longer contextualized setting. The ratings of the model’s answers were divided into four groups:

RATING-A: The response was both valid and satisfactory and was relevant to the evaluation prompt.
RATING-B: This response was acceptable but contained minor errors or imperfections.
RATING-C: Although this response was relevant and addressed the instruction, it contained significant errors in its content.
RATING-D: This response was either irrelevant to the evaluation prompt or entirely invalid for the current topic.

Table 8 lists the results of the experts’ evaluations of the LLMs’ answers. GPT-3.5 achieved high performance and reliability. If A and B are considered acceptable responses, its accuracy reached 83.4%, which is nearly identical to its performance on SLIQ-LL. In contrast, LLaMA2-70B-chat did not generate as many satisfactory answers as GPT-3.5, and it only achieved acceptable performance in 54% of cases.

4. Discussion

Our work is the first to explore the potential applications of large language models (LLMs) in the field of spoken language learning. We aim to create a language learning AI tutor at a professional level that can accurately provide learning feedback. In the past, most work could only address errors at the phoneme level, such as substitution, insertion, or deletion errors. However, diagnostics at the suprasegmental level are highly customized, as a sentence can have different realizations of pronunciation, stress, break, and intonation depending on the context and intended meaning. CAPT systems equipped with SLI provided by LLMs can give spoken language learners more helpful instructions and feedback. Although this area is rarely explored, it holds the potential to fundamentally address the current shortcomings in intelligent language learning [2].

In this paper, we conducted comprehensive experiments to investigate the SLI of LLMs in language learning. The task of extracting conceptual knowledge posed few challenges for these LLMs. However, some models encountered difficulties when using this knowledge for inference. For some small models, due to excellent instruction fine-tuning, their responses were almost always valid. However, for many small or even relatively large models, their performance on the Application Questions subset was worse than random guessing. This is largely because they could not generate valid outputs following instructions. Therefore, we conclude that the results we reported are actually a comprehensive reflection of both SLI and the models’ ability to follow instructions.

Large language models can easily grasp the definitions of linguistic concepts owing to the availability of high-quality content on the Internet. For example, online platforms like Wikipedia clearly explain what a glottal stop is. However, it is highly likely that there is no content on the Internet that addresses questions like “How should I use a proper rhythm to articulate this sentence?—Large language models have the potential to advance the development of human-computer interaction systems and educational technologies by playing the role of speech/language experts”. Such questions are highly customized and require the utilization of knowledge structures for analysis, specifically the ability to reason. However, many researchers believe that current artificial intelligence systems still lack the reasoning ability that is representative of human intelligence [62,63,64]. Despite the encouraging performance of LLMs in certain reasoning tasks, e.g., arithmetic reasoning and common-sense reasoning, current LLMs still perform poorly in most tasks that are relatively easy for humans, e.g., empathy, emotional support, and subjective decision making [20,65]. With spoken language learning, students need the help of proficient language teachers. Accordingly, LLMs should have SLI for the CALL applications. The results of the current study suggest that the LLMs could handle concept-related questions but still lacked sufficiently powerful capabilities to answer our “expert-level” questions. In this context, it is in line with the above-mentioned studies.

Some widely proven effective prompting methods combined with domain-specific examples resulted in significant performance improvements for the models (e.g., GPT-3.5: 49.1% → 64.4%; LLaMA2-70B-Chat: 42.2% → 48.6%, see Table 3 and Table 5 for details). However, these performance improvements were evident mostly in relatively larger models. Moreover, in most cases, appending more examples in direct and CoT scenarios did not yield continuous performance improvements. Considering the increasing number of consumed tokens, we consider 3∼5 shots a reasonable choice.

As for using external tools, GPT-3.5 did not improve the reasoning accuracy in application questions after leveraging knowledge from Google and Wikipedia compared to zero-shot scenarios. This finding suggests that these questions cannot be easily resolved through Internet searches. A reliable alternative approach is to establish a dedicated knowledge repository as an additional source of information.

During our evaluation in dialogue mode, we observed that GPT-3.5 demonstrated a high level of usability. It maintained a strong focus on the subject and content of the conversation, showing minimal tendency to veer off-topic as the dialogue progressed. Its reasoning abilities remained consistent, comparable to its performance in single-turn tests. In contrast, LLaMA2-70B-chat often became perplexed by its own generated responses, digressing into self-referential narratives and forgetting to prioritize the user’s input. The interactive nature of dialogue-based language interaction continues to be a captivating approach, especially in the field of language learning, consistently attracting the interest of researchers and developers in the industry. Specifically, implementing highly effective oral interaction between language learners and LLM-powered CAPT systems will be in great demand [66].

5. Conclusions and Future Work

To evaluate the efficacy of LLMs in the realm of education, this study first introduced a dataset on spoken language intelligence, which can serve as a substantial benchmark for speech and language learning scenarios and provide valuable supplementation to existing benchmarks. Moreover, we explored zero-shot, few-shot, direct, and CoT prompts for answering phonology-related questions. These models all possessed strong conceptual knowledge and achieved high accuracy in simple zero- and few-shot learning scenarios. In practical question reasoning, the prompting methods attained notable performance improvements in comparison with the zero-shot baseline (GPT-3.5: 49.1% → 64.4%; LLaMA2-70B-Chat: 42.2% → 48.6%), and the strongest GPT-4 achieved 77.4% accuracy. The experimental results also demonstrated that in-domain example sampling techniques can consistently improve performance on domain-specific data. These performance results highlight the impressive SLI exhibited by LLMs. In addition, LLMs hold considerable promise for improving conversational spoken language learning. The findings of this study can shed light on the application of LLMs to spoken language learning and provide suggestions for other studies, e.g., in terms of model selection, prompting method, parameter setting, etc.

The present study only investigated the SLI of LLMs from the textual perspective. However, in future studies, we expect to place greater emphasis on the acoustic perspective (i.e., speech modality). For example, when directly presented with a speech segment from a conversation, it is important for LLMs to accurately identify which vowel is being spoken at any given moment, along with associated parameters such as the sampling rate, information about the correct pronunciation of specific phonemes (e.g., duration, pitch, and formants), or other relevant downstream information. To differentiate from ASR, prompts like “Is this a high or low vowel?” or “Which phoneme’s pronunciation should be improved and how?” could be utilized. Furthermore, in language learning settings, it is important to consider performance in multilingual scenarios. For example, Chinese English learners may prefer feedback presented in their mother tongue rather than English. Therefore, in future studies, it is worth paying attention to such factors and catering to the needs of learners to ensure effective language acquisition.

Author Contributions

Methodology, Y.G. and L.P.; writing—original draft preparation, Y.G., L.P. and B.N.; writing—review and editing, Y.G., L.P., B.N. and Y.L.; resources, B.N. and Y.L.; supervision, Y.G. and L.P.; project administration, Y.L.; funding acquisition, Y.G. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Project of the National Language Commission (Grant Number ZDI145-81), the Fundamental Research Funds for the Central Universities (Grant Number 2023RC13), and the National Natural Science Foundation of China (NSFC) (Grant Number 62271083).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The publicly available SLIQ-LL dataset used in this study can be found at https://github.com/vocaliodmiku/SLI-LL (accessed on 20 November 2023).

Conflicts of Interest

The author L.P. was employed by the company NetEase Youdao. This work is not an official product of NetEase Youdao. It can only be used for personal/research/non-commercial purposes. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Eskenazi, M. An overview of spoken language technology for education. Speech Commun. 2009, 51, 832–844. [Google Scholar] [CrossRef]
Rogerson-Revell, P.M. Computer-assisted pronunciation training (CAPT): Current issues and future directions. Relc J. 2021, 52, 189–205. [Google Scholar] [CrossRef]
Kang, O.; Kermad, A. Assessment in second language pronunciation. In The Routledge Handbook of Contemporary English Pronunciation; Routledge: Abingdon-on-Thames, UK, 2017; pp. 511–526. [Google Scholar]
Kang, O.; Rubin, D.; Pickering, L. Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. Mod. Lang. J. 2010, 94, 554–566. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Coope, S.; Farghly, T.; Gerz, D.; Vulić, I.; Henderson, M. Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 107–121. [Google Scholar]
Min, B.; Ross, H.; Sulem, E.; Veyseh, A.P.B.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput. Surv. 2023, 56, 1–40. [Google Scholar] [CrossRef]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. arXiv 2022, arXiv:2203.15556. [Google Scholar]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar]
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.T.; Li, Y.; Lundberg, S.; et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv 2023, arXiv:2303.12712. [Google Scholar]
Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, B.; Lovenia, H.; Ji, Z.; Yu, T.; Chung, W.; et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. arXiv 2023, arXiv:2302.04023. [Google Scholar]
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. arXiv 2020, arXiv:2009.03300. [Google Scholar]
Castro Nascimento, C.M.; Pimentel, A.S. Do Large Language Models Understand Chemistry? A Conversation with ChatGPT. J. Chem. Inf. Model. 2023, 63, 1649–1655. [Google Scholar] [CrossRef]
Frank, M.C. Baby steps in evaluating the capacities of large language models. Nat. Rev. Psychol. 2023, 2., 451–452. [Google Scholar] [CrossRef]
Valmeekam, K.; Olmo, A.; Sreedharan, S.; Kambhampati, S. Large language models still can’t plan (A benchmark for LLMs on planning and reasoning about change). In Proceedings of the NeurIPS 2022 Foundation Models for Decision Making Workshop, New Orleans, LA, USA, 2022. [Google Scholar]
Liévin, V.; Hother, C.E.; Winther, O. Can large language models reason about medical questions? arXiv 2022, arXiv:2207.08143. [Google Scholar]
Dai, W.; Lin, J.; Jin, H.; Li, T.; Tsai, Y.S.; Gašević, D.; Chen, G. Can large language models provide feedback to students? A case study on ChatGPT. In Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), Orem, UT, USA, 10–13 July 2023; pp. 323–325. [Google Scholar]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
Xu, P.; Zhu, X.; Clifton, D.A. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef]
Hossain, M.Z.; Sohel, F.; Shiratuddin, M.F.; Laga, H. A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. (CsUR) 2019, 51, 1–36. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Ling, S.; Hu, Y.; Qian, S.; Ye, G.; Qian, Y.; Gong, Y.; Lin, E.; Zeng, M. Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition. arXiv 2023, arXiv:2307.08234. [Google Scholar]
Sigurgeirsson, A.T.; King, S. Using a Large Language Model to Control Speaking Style for Expressive TTS. arXiv 2023, arXiv:2305.10321. [Google Scholar]
Rubenstein, P.K.; Asawaroengchai, C.; Nguyen, D.D.; Bapna, A.; Borsos, Z.; de Chaumont Quitry, F.; Chen, P.; Badawy, D.E.; Han, W.; Kharitonov, E.; et al. AudioPaLM: A Large Language Model That Can Speak and Listen. arXiv 2023, arXiv:2306.12925. [Google Scholar]
Borsos, Z.; Marinier, R.; Vincent, D.; Kharitonov, E.; Pietquin, O.; Sharifi, M.; Roblek, D.; Teboul, O.; Grangier, D.; Tagliasacchi, M.; et al. AudioLM: A Language Modeling Approach to Audio Generation. arXiv 2023, arXiv:2209.03143. [Google Scholar] [CrossRef]
Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. PaLM 2 Technical Report. arXiv 2023, arXiv:2305.10403. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903. [Google Scholar]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.; et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv 2022, arXiv:2112.09332. [Google Scholar]
Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. arXiv 2023, arXiv:2302.04761. [Google Scholar]
Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. Pal: Program-aided language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 10764–10799. [Google Scholar]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA, 3–10 March 2021; pp. 610–623. [Google Scholar] [CrossRef]
Kang, O.; Rubin, D.L. Reverse linguistic stereotyping: Measuring the effect of listener expectations on speech evaluation. J. Lang. Soc. Psychol. 2009, 28, 441–456. [Google Scholar] [CrossRef]
Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef] [PubMed]
Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.M.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv 2022, arXiv:2206.04615. [Google Scholar]
Ladefoged, P.; Johnson, K. A Course in Phonetics; Cengage Learning: Boston, MA, USA, 2014. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. Race: Large-scale reading comprehension dataset from examinations. arXiv 2017, arXiv:1704.04683. [Google Scholar]
Lin, S.; Hilton, J.; Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv 2021, arXiv:2109.07958. [Google Scholar]
Robinson, J.; Rytting, C.M.; Wingate, D. Leveraging large language models for multiple choice question answering. arXiv 2022, arXiv:2210.12353. [Google Scholar]
Imani, S.; Du, L.; Shrivastava, H. Mathprompter: Mathematical reasoning using large language models. arXiv 2023, arXiv:2303.05398. [Google Scholar]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Zhou, D. Rationale-augmented ensembles in language models. arXiv 2022, arXiv:2207.00747. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; tau Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2021, arXiv:2005.11401. [Google Scholar]
Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; van den Driessche, G.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving language models by retrieving from trillions of tokens. arXiv 2022, arXiv:2112.04426. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Tay, Y.; Dehghani, M.; Tran, V.Q.; Garcia, X.; Bahri, D.; Schuster, T.; Zheng, H.S.; Houlsby, N.; Metzler, D. Unifying language learning paradigms. arXiv 2022, arXiv:2205.05131. [Google Scholar]
Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar]
Biderman, S.; Schoelkopf, H.; Anthony, Q.; Bradley, H.; O’Brien, K.; Hallahan, E.; Khan, M.A.; Purohit, S.; Prashanth, U.S.; Raff, E.; et al. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. arXiv 2023, arXiv:2304.01373. [Google Scholar]
Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv 2023, arXiv:2306.05685. [Google Scholar]
Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Stanford Alpaca: An Instruction-Following LLaMA Model. Available online: https://github.com/tatsu-lab/stanford_alpaca (accessed on 11 August 2023).
Fu, Y.; Ou, L.; Chen, M.; Wan, Y.; Peng, H.; Khot, T. Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models’ Reasoning Performance. arXiv 2023, arXiv:2305.17306. [Google Scholar]
Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlíček, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
Witt, S.; Young, S. Computer-assisted pronunciation teaching based on automatic speech recognition. In Language Teaching and Language Technology; Routledge: Abingdon-on-Thames, UK, 2014; pp. 25–35. [Google Scholar]
Marcus, G. The next decade in AI: Four steps towards robust artificial intelligence. arXiv 2020, arXiv:2002.06177. [Google Scholar]
Russin, J.; O’Reilly, R.C.; Bengio, Y. Deep learning needs a prefrontal cortex. Work. Bridg. AI Cogn. Sci. 2020, 107, 1. [Google Scholar]
Mitchell, M. Abstraction and analogy-making in artificial intelligence. Ann. N. Y. Acad. Sci. USA 2021, 1505, 79–101. [Google Scholar] [CrossRef] [PubMed]
Huang, J.; Chang, K.C.C. Towards reasoning in large language models: A survey. arXiv 2022, arXiv:2212.10403. [Google Scholar]
Lin, V.; Yeh, H.C.; Chen, N.S. A systematic review on oral interactions in robot-assisted language learning. Electronics 2022, 11, 290. [Google Scholar] [CrossRef]

Figure 1. An example of answering an SLIQ-LL application question using zero-shot CoT prompting, “Let’s think step by step” [43].

Figure 2. The distribution of different problem types and their answer options in the Application Questions subset. (Left) Distribution of problem types. We labeled each question based on the type of problem and all the corresponding options. (Right) Distribution of four answer options. “Overall" refers to the entire dataset.

Figure 3. Overall performance on the SLIQ-LL dataset. We report the best results of each model among four different prompting methods. The results in the figure are the raw results minus 25% (random selection level).

Figure 4. The distribution of performance on two subsets across different model sizes.

Figure 5. The accuracy distribution of all tested LLMs for different question types in the Application Questions subset.

Figure 6. Number of correctly answered questions (total 301) on SLIQ-LL (Application Questions subset) using

k = 1 \dots 9

shots. The developed examples used in the previous experiments were manually written, whereas the current ones were sampled from the dataset. The thought chain was generated by GPT-4 with correctly generated answers (the thought chain is not guaranteed to be correct).

Figure 6. Number of correctly answered questions (total 301) on SLIQ-LL (Application Questions subset) using

k = 1 \dots 9

shots. The developed examples used in the previous experiments were manually written, whereas the current ones were sampled from the dataset. The thought chain was generated by GPT-4 with correctly generated answers (the thought chain is not guaranteed to be correct).

Figure 7. Few-shot CoT performance of in-domain (ID) and out-of-domain (OOD) prompts. For ID prompts, we customized examples related to the question type, whereas for OOD prompts, we randomly picked one OOD sample of 20 disciplines from the MMLU dataset with CoT prompts [59].

Table 1. Prompt templates. The square blue brackets (e.g., [provided data]) represent the data provided to the model, whereas the triangular red brackets (e.g., <completions>) represent the parts that the model needs to generate. The symbol ∅ represents an empty string. Self-consistency is not included in the table.

Task Description/System	You Are an Expert in Phonetics, English Phonology, and Second Language Acquisition.
	Here Is a Multiple-Choice Question for You to Answer Correctly.
	Zero-shot	Few-shot
Shot	∅	Question: `[Question]`
		Answer: `[answer]`
		…
Question	Question: `[Question]`	Question: `[Question]`
Answer	Answer: `<answer>`	Answer: `<answer>`
	Zero-shot CoT	Few-shot CoT
Shot	∅	Question: `[Question]`
		Answer: Let’s think step by step. `[CoT]`
		Therefore, the answer is `[answer]`
		…
Question	Question: `[Question]`	Question: `[Question]`
CoT	Answer: Let’s think step by step. `<CoT>`	Answer: Let’s think step by step. `<CoT>`
Answer	Therefore, the answer is `<answer>`	Therefore, the answer is `<answer>`
	In-domain Exemplar	Tool Augmentation
Question	Question: `[Question about T^★]`	Question: `[Question]`
	Answer: Let’s think step by step. `[CoT]`	Thought: `<CoT>` + I need to use some tools to find the answer.
	Therefore, the answer is `[answer]`	Action: {"action": "`<tool>`", “input”: "`<input>`"}
	…	Observation: The `<tool>` contains detailed information about the `<input>`.
	Question: `[Question about T]`	…
CoT	Answer: Let’s think step by step. `<CoT>`	Thought: `<CoT>`
Answer	Therefore, the answer is `<answer>`	Therefore, the answer is `<answer>`

★

T \in {Pronunciation, Stress, Break, Intonation}

.

Table 2. Statistics of LLMs used in this work. “Adaptation” refers to whether the model has been subsequently fine-tuned, IT denotes instruction tuning, and RLHF denotes reinforcement learning with human feedback. “Evaluation” indicates whether the model has been evaluated in terms of the ICL and CoT abilities in their original papers. GPT-3.5 is an upgraded version of GPT-3 with RLHF, and GPT-3.5-turbo belongs to the GPT-3.5 series, which is the interface used to invoke Chat-GPT.

`Model`	Release	Size	Base	Adaptation		Pre-Train	Hardware	Training/FT	Evaluation
`Model`	Time	(B)	Model	IT	RLHF	Data Scale	(GPUs / TPUs)	Time	ICL	CoT
`LLaMA [52]`	February -2023	7∼65	-	-	-	1.4 T tokens	2048 80 G A100	21 d	√	-
`Vicuna [57]`	March-2023	7∼33	LLaMA	-	-	-	8 80 G A100	-	√	√
`Alpaca [58]`	March-2023	7	LLaMA	√	-	-	4 80 G A100	3 h	-	-
`Flan-T5 [55]`	October-2022	11 (XXL)	T5	√	-	-	-	-	√	√
`Flan-UL2 [54]`	March-2023	20	UL2	√	-	-	-	-	√	√
`Pythia [56]`	April-2023	12	-	-	-	300 B tokens	256 40 G A100	-	√	-
`LLaMA2 [52]`	July-2023	7∼70	-	√	√	2.0 T tokens	2000 80 G A100	-	√	√
`GPT-3 [12]`	May-2020	175	-	-	-	300 B tokens	-	-	√	-
`GPT-4 [14]`	March-2023	-	-	√	√	-	-	-	√	√

This table is mainly adapted from [44].

Table 3. The accuracy of various LLMs’ SLI in the first round of experiments.

Model	Prompt	Shot	Concept (144)	$Δ$	Applied Questions (301)	$Δ$	Overall
`LLaMA1-7B`	`Direct`	0	38.9% (56)		0.0% (0)		12.6%
`LLaMA1-7B`	`Direct`	3	14.6% (21)	−24.3%	4.0% (12)	+4.0%	7.4%
`LLaMA1-7B`	`CoT`	0	4.9% (7)	−34.0%	4.3% (13)	+4.3%	4.5%
`LLaMA1-7B`	`CoT`	3	40.3% (58)	+1.4%	31.2% (94)	+31.2%	34.2%
`Alpaca-7B`	`Direct`	0	50.7% (73)		27.6% (83)		35.1%
`Alpaca-7B`	`Direct`	3	41.0% (59)	−9.7%	28.6% (86)	+1.0%	32.6%
`Alpaca-7B`	`CoT`	0	13.9% (20)	−36.8%	7.6% (23)	−20.0%	9.6%
`Alpaca-7B`	`CoT`	3	39.6% (57)	−11.1%	33.2% (100)	+5.6%	35.3%
`Vicuna-7B`	`Direct`	0	50.0% (72)		18.6% (56)		28.8%
`Vicuna-7B`	`Direct`	3	0.7% (1)	−49.3%	1.0% (3)	−17.6%	0.9%
`Vicuna-7B`	`CoT`	0	57.6% (83)	+7.6%	22.3% (67)	+3.7%	33.7%
`Vicuna-7B`	`CoT`	3	60.4% (87)	+10.4%	33.2% (100)	+14.6%	42.0%
`LLaMA2-7B`	`Direct`	0	38.2% (55)		15.3% (46)		22.7%
`LLaMA2-7B`	`Direct`	3	45.1% (65)	+6.9%	13.6% (41)	−1.7%	23.8%
`LLaMA2-7B`	`CoT`	0	11.1% (16)	−27.1%	4.0% (13)	−11.3%	6.5%
`LLaMA2-7B`	`CoT`	3	68.1% (98)	+29.9%	36.2% (109)	+20.9%	46.5%
`LLaMA2-7B-Chat`	`Direct`	0	72.2% (104)		31.2% (94)		44.5%
`LLaMA2-7B-Chat`	`Direct`	3	70.8% (102)	−1.4%	31.0% (93)	−0.2%	43.8%
`LLaMA2-7B-Chat`	`CoT`	0	86.1% (124)	+13.9%	28.6% (86)	−2.6%	47.2%
`LLaMA2-7B-Chat`	`CoT`	3	77.1% (111)	+4.9%	33.2% (100)	+2.0%	47.4%
`Pythia-7B`	`Direct`	0	2.8% (4)		18.9% (57)		13.7%
`Pythia-7B`	`Direct`	3	15.3% (22)	+12.5%	15.3% (46)	−3.6%	15.3%
`Pythia-7B`	`CoT`	0	8.3% (12)	+5.5%	12.0% (36)	−6.9%	10.8%
`Pythia-7B`	`CoT`	3	14.6% (21)	+11.8%	26.9% (81)	+8.0%	22.9%
`Flan-T5-XXL-11B`	`Direct`	0	88.2% (127)		46.2% (139)		59.8%
`Flan-T5-XXL-11B`	`Direct`	3	88.9% (128)	+0.7%	47.8% (144)	+1.6%	61.1%
`Flan-T5-XXL-11B`	`CoT`	0	79.2% (114)	−9.0%	37.9% (114)	−8.3%	51.2%
`Flan-T5-XXL-11B`	`CoT`	3	86.1% (124)	−2.1%	40.5% (122)	−5.7%	55.3%
`Pythia-12B`	`Direct`	0	20.1% (29)		24.9% (75)		23.4%
`Pythia-12B`	`Direct`	3	20.1% (29)	+0.0%	26.6% (80)	+1.7%	24.5%
`Pythia-12B`	`CoT`	0	18.1% (26)	−2.0%	19.3% (58)	−5.6%	18.9%
`Pythia-12B`	`CoT`	3	22.2% (32)	+2.1%	22.3% (67)	−2.6%	22.2%
`LLaMA1-13B`	`Direct`	0	33.3% (48)		7.6% (23)		15.9%
`LLaMA1-13B`	`Direct`	3	66.7% (96)	+33.4%	26.2% (79)	+18.6%	39.3%
`LLaMA1-13B`	`CoT`	0	25.0% (36)	−8.3%	11.3% (34)	+3.7%	15.7%
`LLaMA1-13B`	`CoT`	3	65.3% (94)	+32.0%	33.6% (101)	+26.0%	43.8%
`LLaMA2-13B`	`Direct`	0	72.2% (104)		30.2% (91)		43.8%
`LLaMA2-13B`	`Direct`	3	81.3% (117)	+9.1%	16.3% (49)	−9.3%	37.3%
`LLaMA2-13B`	`CoT`	0	25.7% (37)	+46.5%	10.6% (32)	−19.6%	15.5%
`LLaMA2-13B`	`CoT`	3	83.3% (120)	+11.1%	42.2% (127)	+19.9%	55.5%
`LLaMA2-13B-Chat`	`Direct`	0	88.2% (127)		35.2% (106)		52.4%
`LLaMA2-13B-Chat`	`Direct`	3	74.3% (107)	−13.9%	36.2% (109)	−1.0%	48.5%
`LLaMA2-13B-Chat`	`CoT`	0	84.7% (122)	−3.5%	38.9% (117)	+3.7%	53.7%
`LLaMA2-13B-Chat`	`CoT`	3	85.4% (123)	−2.8%	40.2% (121)	+5.0%	54.8%
`Vicuna-13B`	`Direct`	0	63.2% (91)		19.9% (60)		33.9%
`Vicuna-13B`	`Direct`	3	6.9% (10)	−56.3%	10.6% (32)	−9.3%	9.4%
`Vicuna-13B`	`CoT`	0	72.9% (105)	+9.7%	30.6% (92)	+10.7%	44.3%
`Vicuna-13B`	`CoT`	3	83.3% (120)	+20.1%	38.9% (117)	+9.0%	53.3%
`Flan-UL2-20B`	`Direct`	0	87.5% (126)		41.9% (126)		56.6%
`Flan-UL2-20B`	`Direct`	3	88.9% (128)	+1.4%	44.5% (134)	+2.6%	58.9%
`Flan-UL2-20B`	`CoT`	0	86.8% (125)	−0.7%	38.9% (117)	−3.0%	54.4%
`Flan-UL2-20B`	`CoT`	3	88.9% (128)	+1.4%	38.5% (116)	−3.4%	54.8%
`LLaMA1-30B`	`Direct`	0	78.5% (113)		0.7% (2)		25.8%
`LLaMA1-30B`	`Direct`	3	86.8% (125)	+8.3%	38.5% (116)	+37.8%	54.2%
`LLaMA1-30B`	`CoT`	0	25.7% (37)	−52.8%	13.6% (41)	+18.9%	17.5%
`LLaMA1-30B`	`CoT`	3	82.0% (118)	+3.5%	41.2% (124)	+40.5%	54.4%
`Vicuna-33B`	`Direct`	0	71.5% (103)		36.2% (109)		47.6%
`Vicuna-33B`	`Direct`	3	81.9% (118)	+10.4%	37.9% (114)	+1.7%	52.1%
`Vicuna-33B`	`CoT`	0	74.3% (107)	+2.8%	36.5% (110)	+0.3%	48.8%
`Vicuna-33B`	`CoT`	3	77.8% (112)	+6.3%	42.5% (128)	+6.3%	53.9%
`LLaMA1-65B`	`Direct`	0	0.0% (0)		17.3% (52)		11.7%
`LLaMA1-65B`	`Direct`	3	86.1% (124)	+86.1%	42.2% (127)	+24.9%	56.4%
`LLaMA1-65B`	`CoT`	0	23.6% (34)	+23.6%	9.0% (27)	−8.3%	13.7%
`LLaMA1-65B`	`CoT`	3	86.8% (125)	+86.8%	47.8% (144)	+30.5%	60.4%
`LLaMA2-70B`	`Direct`	0	84.7% (122)		5.0% (15)		30.8%
`LLaMA2-70B`	`Direct`	3	91.6% (132)	+6.9%	50.8% (153)	+45.8%	64.0%
`LLaMA2-70B`	`CoT`	0	36.8% (53)	−47.9%	12.6% (38)	+7.6%	20.4%
`LLaMA2-70B`	`CoT`	3	85.4% (123)	+0.7%	50.2% (151)	+45.2%	61.6%
`LLaMA2-70B-Chat`	`Direct`	0	86.8% (125)		42.2% (127)		56.6%
`LLaMA2-70B-Chat`	`Direct`	3	88.9% (128)	+2.1%	42.2% (127)	+0.0%	57.3%
`LLaMA2-70B-Chat`	`CoT`	0	88.2% (127)	+1.4%	44.2% (133)	+2.0%	58.4%
`LLaMA2-70B-Chat`	`CoT`	3	88.2% (127)	+1.4%	48.5% (146)	+6.3%	61.3%
`GPT-3.5-turbo`	`Direct`	0	93.0% (134)		49.1% (148)		63.4%
`GPT-3.5-turbo`	`Direct`	3	95.8% (138)	+2.8%	53.5% (161)	+4.4%	67.2%
`GPT-3.5-turbo`	`CoT`	0	85.4% (123)	−7.6%	54.2% (163)	+5.1%	64.3%
`GPT-3.5-turbo`	`CoT`	3	91.7% (132)	−1.3%	56.8% (171)	+7.7%	68.1%
`GPT-4`	`Direct`	0	96.5% (139)		73.4% (221)		80.9%
`GPT-4`	`Direct`	3	97.2% (140)	+0.7%	73.1% (220)	+0.3%	80.9%
`GPT-4`	`CoT`	0	96.5% (139)	+0.0%	77.4% (233)	+4.0%	83.6%
`GPT-4`	`CoT`	3	97.2% (140)	+0.7%	77.4% (233)	+4.0%	83.8%

Table 4. Frequency of predictions and labels. The underestimated predictions are marked with ▼, whereas the overestimated predictions are marked with ▲ (±20% of the label frequency). The last column indicates the p-value of the

χ^{2}

test for the null hypothesis “The answer distribution produced by LLM equals that of the ground-truth”.

Table 4. Frequency of predictions and labels. The underestimated predictions are marked with ▼, whereas the overestimated predictions are marked with ▲ (±20% of the label frequency). The last column indicates the p-value of the

χ^{2}

test for the null hypothesis “The answer distribution produced by LLM equals that of the ground-truth”.

Model	A	B	C	D	Acc.	p-Value
GPT-4	77	84	69	71	74.6%	$5 \times 10^{- 1}$
GPT-3.5-turbo	76	91	70	64	53.1%	$2 \times 10^{- 1}$
LLaMA2-70B-chat	64	87	100 ▲	50 ▼	44.3%	$2 \times 10^{- 3}$
Flan-UL2-20B	93 ▲	56 ▼	70	82	41.6%	$4 \times 10^{- 3}$
Flan-T5-XXL-11B	93 ▲	67	80	61	44.8%	$6 \times 10^{- 2}$
No. of Labels	74	76	80	71

Averaged using

{0, 3, 5, 8, 10, 12, 15, 20, 25, 30}

shots and 1∼9-shot CoT results. Part of the experimental data (i.e., CoT) come from Section 3.3.

Table 5. Comparison of self-consistency with CoT prompts between the GPT-3.5 and LLaMA2-70B-CHAT models. Self-consistency was derived using the top 5 few-shot CoT results.

Model	CoT	Self-Consistency
`GPT-3.5`	60.1 ± 1.5	64.4
`LLaMA2-70B-chat`	48.6 ± 1.2	48.2

Table 6. The performance of tools that augmented GPT-3.5. We used Google and Wikipedia. Explicit Reject represents the model’s explicit rejection of a question such as, “Based on the available information, the answer cannot be determined”. True Reject means that the model’s rejection avoided generating incorrect answers in a zero-shot generation.

Method	Acc.	Explicit Reject	True Reject
`Zero-Shot`	49.1	1	-
`Tools Aug.`	49.1	14	6

Table 7. An example of an initial system prompt for a multi-turn conversation.

System Prompt

You are an expert in phonetics, English phonology, and second language acquisition. You will play the role of an English teacher who is helping me practice American English pronunciation and speaking skills. You will see a spoken speech evaluation result, where context provides the context for this pronunciation, canonical is the text of this evaluation and represents the expected pronunciation of the speaker, soundlike is how the user’s pronunciation actually sounds, and SentenceScore is the sentence pronunciation score, with a higher score indicating better pronunciation. Fluency is the sentence fluency score, with a higher score indicating better fluency. Speed is the speech rate, which is the average number of milliseconds per phoneme, and emotion is the emotion. In WordScores, each word score is shown in parentheses, and PhonesScores contains the phoneme pronunciation score for each word and what these phonemes actually sound like. Liaison represents the connected sounds between two words, marked with a [∼] symbol. Break represents a break between two words that is greater than 200ms, marked with a [pause] symbol. Stress represents the emphasized syllable or word in a sentence, marked with an asterisk symbol to indicate a word that is emphasized more than the others being compared. Intonation indicates whether the sentence’s intonation is rising, falling, or flat.

Table 8. Distribution (%) of LLMs’ answers among multi-turn conversations for 20 CAPT test samples. Their SLI was rated by human experts in terms of satisfaction, validity, and relevance to the prompts.

Model	A	B	C	D
`GPT-3.5`	55.6	27.8	16.7	0.0
`LLaMA2-70B-chat`	35.1	18.9	43.2	2.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, Y.; Nuchged, B.; Li, Y.; Peng, L. An Investigation of Applying Large Language Models to Spoken Language Learning. Appl. Sci. 2024, 14, 224. https://doi.org/10.3390/app14010224

AMA Style

Gao Y, Nuchged B, Li Y, Peng L. An Investigation of Applying Large Language Models to Spoken Language Learning. Applied Sciences. 2024; 14(1):224. https://doi.org/10.3390/app14010224

Chicago/Turabian Style

Gao, Yingming, Baorian Nuchged, Ya Li, and Linkai Peng. 2024. "An Investigation of Applying Large Language Models to Spoken Language Learning" Applied Sciences 14, no. 1: 224. https://doi.org/10.3390/app14010224

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Investigation of Applying Large Language Models to Spoken Language Learning

Abstract

1. Introduction

1.1. Related Works

1.2. The Contributions of This Study

2. Methodology

2.1. Dataset

2.2. Prompting Methods

3. Experiments and Results

3.1. Experimental Setup

3.2. Zero-Shot and Few-Shot Benchmarks

3.3. Analysis of Advanced Prompting Methods

3.4. Experts’ Evaluations of the Multi-Turn Conversations

4. Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI