1. Introduction
Effective communication is a key attribute for university graduates. For many higher education institutions (HEIs), “students’ capabilities to… read, write, listen, and speak effectively are all now in the spotlight and are an institutional concern” [
1] (p. 55). Language training often focuses on developing students’ productive language skills, including speaking, which are especially crucial for HEIs that employ English as the medium of instruction (EMI) and require oral presentations in English as a common form of assessment. Most courses within a student’s major include at least one oral presentation assessment, such as an individual/group PowerPoint or poster presentation. However, it is not feasible for these courses to include all the potential speaking genres that undergraduate students from different disciplines will encounter during their academic pursuits. It is, therefore, necessary to think outside the box—in this case beyond courses and lessons—to explore other means of helping students understand and meet the requirements of different academic speaking tasks, as well as providing them with practice opportunities and useful feedback for improvement.
This paper reports on a study that attempted to incorporate readily available AI tools into a one-stop platform, on which HEI students could access self-directed learning materials and automated feedback for enhancing their presentation skills. It lays out the journey by an interdisciplinary team of academics to identify the right AI tools in order to build a self-directed elearning platform that can intrigue learners, offer some actionable feedback, and be deployed economically and swiftly enough to support unsupervised learning. The applicability, reliability, limitations and further development opportunities of the said AI tools were evaluated based on a beta test with 24 students and a comparison of AI and human scoring of the presentation performance of 36 students.
3. Methodology
3.1. Needs and Obstacles in the Present Research Context
The pandemic has disrupted both language learning activities and high-stakes testing. Students, regardless of academic discipline, require good English presentation skills. In the two HEIs in which this study was conducted, many of the students admitted to undergraduate degree programs had English levels equivalent to International English Language Testing System scores of 5.5 to 6. The HEIs employ EMI, which means all oral presentation assignments are delivered completely in English. Due to the large number of students in many courses, grading students’ oral presentations takes considerable time, which means teachers may only be able to focus on content, as opposed to communication skills, when grading presentation assignments.
Although teachers may not have the time to focus on the communication aspects of oral presentations, such as language use, delivery, and pronunciation, they want to offer students pre-assessment training opportunities. Technological advancement has facilitated the diagnosis of speech problems and the integration of AI technologies in foreign language education can support flexible, interactive, and learner-centered approaches. Seeing this as an opportunity, a team consisting of an educational developer, discipline-specific teachers, and language teachers decided to collaborate on the use of ASR to offer prompt feedback and interactive oral practice to support self-paced learning. It is, however, not easy to develop a complete ASR algorithm with reliable performance, which calls into question the objectivity of grading by AI. The ongoing development of AI technology must overcome many obstacles to resolve the existing technical problems in oral English assessment.
3.2. Rationale for the Study
There have been studies on the learning of English or English presentations by EAL students. However, research on the challenges and opportunities in the construction of an online platform to address the needs of EAL students specifically in learning English presentation is scarce. Seeing students struggle with English presentations and drawing on the literature on the potential of using AI in language evaluation, we formed a team of academics from an engineering department, an English language center, and an educational development center in two universities to explore the development of a platform for EAL university students to learn and practice English presentations. The platform would provide learning units focused on the skills required for delivering a strong presentation and allow AI evaluation of oral presentations. The team members from the engineering department lead the technological development of the platform; those from the English language center identified the oral presentation training needs of EAL students and tailored the platform to create learning units that targeted problems specific to those needs, and the team member from the educational development center advised on the educational needs and appropriate teaching methods for HE students.
3.3. Research Questions
To understand whether our AI-assisted learning platform is useful to students and to inform future development, we posed the following research questions:
- (1)
Which quantifiers could be included in an AI-assisted presentation training platform?
- (2)
What are the challenges and opportunities regarding the development and use of an AI-assisted presentation training platform?
3.4. Research Participants
To answer the research questions, the study involved both engineering and non-engineering students, targeting engineering students from a polytechnic university and humanities students from a mainly liberal arts university. Undergraduate students with different first languages in different years of study were invited to participate in an online survey.
Table 1 provides a summary of their demographic composition. Consent was obtained from five students to participate in a follow-up focus group interview.
We also sent out a needs analysis survey that targeted the teachers of the invited students and received 9 responses. All of them were non-language teachers and non-native English speakers.
3.5. Research Methods and Instruments
This study employed a mixed research approach as the dataset from any single procedure would not fully answer the research questions [
43,
44]. Quantitative data came from surveys that were conducted before and after the team developed the AI-assisted presentation training platform. Numerical data were also obtained from AI and human scoring of student presentations, allowing direct comparisons. Qualitative data were derived from focus group interviews to deepen the understanding of students’ needs and preferences.
To investigate students’ genuine needs in acquiring presentation skills, we conducted a needs analysis via an online survey, using Microsoft Forms, which consisted of a total of 35 questions in English. The survey for students comprised 14 questions to gauge their feelings about giving oral presentations, their oral presentation needs, and areas in which an AI-assisted presentation training platform could help. The teachers of these students also completed an online survey which comprised 21 questions about their perceptions of students’ feelings about oral presentations, their grading experience, and areas in which an AI-assisted presentation trainer could help in assessing presentations.
Besides, focus group interviews with students were designed to follow up on the survey results and to deepen understanding of students’ views and responses. All survey respondents were invited to join the interviews, which were transcribed verbatim.
3.6. Research Procedure
The team sent mass invitation emails to students and teachers of engineering and humanities disciplines about participation in the online baseline survey. Participation in the surveys and interviews was voluntary. Respondents were given a timeframe to submit their responses online.
Based on the results of the needs analysis, the team designed an online AI-assisted presentation training platform that features learning units catering to the needs of EAL speakers. This platform also features AI assessment of key components of oral presentations, such as pronunciation accuracy, fluency, vocal fillers, and facial emotions.
After a quick pilot build of the AI-assisted platform, the team conducted a beta test, using a Microsoft Forms survey with students to test the reliability of the system. This pilot build of the platform offered three tools for testing: pronunciation, facial expression, and vocal filler. Participants began by completing a 5-min survey to test their baseline understanding of oral presentations. They were then required to study the learning units and submit two video-recorded presentations for AI assessments. Finally, students were asked to complete a 10-min survey to provide their feedback on using our platform. The duration of the beta test process took as little as 45 min, or longer when respondents chose to study the provided content in depth.
To compare AI and human scoring, the team hired an English teaching professional—a teacher who is not currently teaching at the two universities involved in the study but has over 10 years of experience in English teaching, curriculum development, course delivery, and assessment of EAL courses at university level—to rate students’ oral presentations that had been submitted on our platform. The team then compared the manually graded results with the AI-graded results to discern differences in human and AI scores, shed light on the strengths and weaknesses of AI evaluation, and identify areas where AI would need to evolve to streamline the assessment process and enhance its precision.
4. Results and Discussion
4.1. Baseline Study on Needs Analysis
The baseline needs analysis aimed to determine how students and teachers perceive oral presentations as an assessment form, as well as to identify the training needs of students and the grading needs of teachers. We received survey responses from 104 students and nine teachers and conducted focus group interviews with five students who indicated on the needs analysis survey that they were willing to be interviewed.
From the nine responses from non-language teachers, we noted that most teachers would provide feedback on content (75%) and delivery (87.5%) (see
Figure 1).
Given that 88% of the non-language teachers grade 20 to over 101 presentations per semester (see
Figure 2) and 50% review a video presentation more than once, there appears to be a strong grading burden for teachers, and therefore that an AI-assisted system with automated customizable grading assistance would be helpful.
When asked about their preferred AI features, respondents indicated that they would like to include automation in the grading of fluency (78%), accuracy (78%), and eye contact (67%) (see
Figure 3). This shows that teachers can potentially use an AI grading platform to help with non-content grading. Teacher respondents also noted a strong need for additional training in students’ delivery skills (78%), structure and organization (67%), and referencing (33%) (see
Figure 4).
Student respondents came from two broad categories, engineering and humanities. We noted a marked difference between the two broad disciplines, with engineering students disliking oral presentations more than their humanities peers—50% of humanities students responded they liked to do oral presentations, compared with only 26% of engineering students. We noted from some neutral comments that students were well aware of the pain points of oral presentations as an assessment. Student responses included, “oral presentations can showcase public speaking skills and charisma, so there is a higher chance of getting better scores”, and “it can be stressful for the presenter who may focus more on presenting techniques than professional skills.” They noted that those with stronger presentation techniques have a higher chance of scoring better and that nervousness, weakness in English speaking skills, and poor eye contact are undesirable for this kind of assessment. In fact, 24% of engineering students and 8% of humanities students had had poor scoring experiences in the past for presentation tasks. Respondents also reported they were uncomfortable with public speaking and had accent anxiety. Similar to teachers, they considered additional training in delivery skills (57%) more important than structure and organization (48%) and referencing (30%). Their top three most wanted AI-assisted training features were “filler alerts”, “silence checker”, and “eye contact counter”.
The findings from this baseline needs analysis helped us gauge students’ and teachers’ preferences for an AI-assisted presentation training and grading platform, as well as which features to prioritize in its development. Prioritization is particularly important for our small development team, as the AI features currently available from the market are not tailored for oral presentation training. Customization of these features required the team to analyze collected samples, compare perceived accuracy against machine performance, as well as incorporate elements for editable feedback by humans where the automation falls short. This is a very resource-intensive process that may not be viable without prioritization. We learned from the baseline study that presentation delivery is an area where training is a top priority and therefore tried to map existing, well-trained AI features—including accuracy, fluency, vocal fillers, and facial expressions—to support self-paced learning with instant feedback and AI-assisted grading.
4.2. Development of an AI-Assisted Training Platform
Based on the results of the needs analysis, we designed an online AI-assisted presentation training platform, which was developed as a web application so that users could access the platform using any browser. It was designed to be an all-in-one platform for students to learn practical tips about oral presentations and rehearse with AI tools. The platform contains two modules: “Learning Units” and “Course and Assignment”. The “Learning Units” module provides presentation tips to help students learn how to deliver a good presentation. The “Course and Assignment” module allows students to submit their presentation assignments to the platform for AI evaluation and teacher grading. The hierarchical structure of the platform is shown in
Figure 5.
For the “Learning Units” module, our language team designed customized presentation tips for our university students, including topics such as content and structure, delivery, and pronunciation. Each learning unit includes a description and different sections. At the end of each unit, there is an AI activity for students to submit a presentation to practice what they have learned, which will then be evaluated by AI tools for the assessment of elements such as facial expression, vocal fillers, and pronunciation. An example of a learning unit is shown in
Figure 6.
In the “Course and Assignment” module, students can submit a presentation under an assigned course. They can access learning units to receive presentation tips and submit presentation assignments to their assigned course(s) with multiple attempts allowed. After submission, they can check their attempt history with AI results and view the assessment results with the teacher’s feedback.
Figure 7 shows students’ submission of presentation assignments, and
Figure 8 shows the assessment results with the teacher’s feedback.
To evaluate the presentation, the platform uses mature AI web services that are trained by machine learning and readily available on the market. The AI tools adopted include facial expression, vocal filler detection, and pronunciation assessment. For facial expressions, the adopted AI tool can detect happiness, sadness, neutral feelings, anger, contempt, disgust, surprise, and fear.
Figure 9 shows the results from a facial expression analysis.
Vocal filler detection detects the use of filler words in speech. Our language team supplies the list of customized filler words that are commonly used by local university students. This list is fed into our database. We then use an AI tool to detect the appearance of these words in a speech. The result displays the frequency of each filler word, as shown in
Figure 10.
For pronunciation assessments, we use an AI tool to determine accuracy and fluency. Pronunciation accuracy is evaluated based on phonemes, while fluency is evaluated based on silent breaks between words. The overall score is an average score derived from the above two scores. The result displays the accuracy score, the fluency score, the overall score, and the AI-generated script, as shown in
Figure 11.
Numerous challenges were encountered in developing the platform. Since this is a collaborative project across two universities, we had to overcome the challenge of cross-university account logins. To allow students to use their university email accounts to log in and protect their privacy by not storing their passwords in the system, we integrated the authentication service offered by an external provider. Another problem concerns AI tools. We used existing web services, which should have been maturely trained by machine learning. However, most existing AI services could not precisely fulfill the needs of our approach to presentation analysis. We needed to develop a customized algorithm on top of existing AI services to create AI tools specific to our platform. In addition, some AI web services take only images as input, so we were obliged to pre-process the videos to extract useful assets for AI analysis. Considering the challenges of a long waiting time due to video pre-processing and AI analysis, unclear processing status, and data loss due to accidental termination of the AI analyzing process, we created an AI job scheduling subsystem to solve the problem. Each video submission is now generated as an individual task in the system and is automatically stacked in a queue for background processing. This subsystem allows multiple students to submit assignments simultaneously without congestion.
4.3. Beta Test of the Platform
Following the pilot build of the platform, we conducted a beta test to examine the reliability of the system. Students were invited to trial the pilot platform and twenty-four joined the beta test. User responses were very encouraging, as the vast majority of them found the learning units on content and structure (91.7%), delivery (87.5%), and pronunciation (87.5%) helpful for improving their oral presentation skills. Many students found AI tools for pronunciation (79.2%), facial expression (78.3%), and delivery (87.6%) to be helpful as well. The machine-generated scores presented in the form of a scorecard, in particular, invited more interest in fine-tuning specific skills, with students suggesting that the system could list not only “mispronounced words and the corresponding correct sound with IPA” but also “the reasons for their current score and how they can improve”.
Linking these responses to some of the needs identified from the baseline study, we see the potential of both the learning units and the AI tools as useful self-directed learning tools for students. As no humans are involved in the AI scoring, and since students can test their performance as many times as desired without tiring a human assessor, they may find it less uncomfortable having a machine help them polish the skills that they might otherwise be too shy to practice. AI scoring also guarantees that assessment is strictly based on quantifiable parameters, thus minimizing any bias perceived by students, an issue that was revealed as a concern in the needs analysis.
Respondents also indicated that they would welcome more sophisticated and “clever” functions, with some suggesting additional tools for tracking eye contact, body language, and speech pace. The valuable feedback and suggestions illustrate a strong interest in AI and students’ faith in more impartial machine assessment.
4.4. Comparing AI Scoring with Human Scoring
To further ascertain the reliability of AI grading and the areas of the platform that may need improvement, the team decided to invite an external rater who is a highly experienced English language teacher (that does not teach in the universities that developed the platform) to participate in a comparison of the autogenerated AI and human grading. Thirty-six video presentation samples from an engineering class were selected by the subject teacher, including high, medium, and low performances. As a reliability test of the grading component of the pilot platform, these videos were first scored by AI and then coded for anonymity before being graded by the external rater. The rater was tasked to provide qualitative feedback and/or a numerical score for the following metrics: fluency, accuracy, perceived facial expressions (emotions), and vocal fillers. The grading was done according to the rubrics in
Appendix A.
4.4.1. Fluency
The autogenerated fluency scores were compared with the human rater’s and a correlation coefficient of 0.136 was obtained. According to Microsoft, this value falls into the category of low association, which means the autogenerated scores are not significantly aligned with human ratings [
45]. It is also noted that the human fluency score, which spans a 46-point range, is much broader than the AI range of 13.7 points. This weak correlation may be attributed to the use of “silent breaks between words” as the main measurement in the autogenerated fluency score. As illustrated in the rater’s qualitative comments on samples with the biggest (51.4) and smallest (1.3) scoring differences below, the human perception of poor fluency can come from inefficient use of word linking, chunking, sentence stress, and rhythm, all of which cannot be measured by silence breaks alone (see
Table 2).
In assessing fluent speakers, however, the silence break measurement is more accepted by the human rater. The human rater’s qualitative comments on video samples that score well on both sides consist of descriptors such as “very little hesitation”, “some stumbling but otherwise fluent throughout”, and “fluent with minimal distracting breaks.” Therefore, we contend that the machine-generated feedback on the frequency of silent breaks as the main indicator of fluency may only be reliable for fluent speakers. Yet, the same mechanism would fail to screen out poor performers, who may confuse the tool by using less obvious tricks between words—for example, the autogenerated fluency score would not be affected even when meaningless utterances are made. As a result, the score may fail to reflect poor fluency, making it difficult to provide actionable feedback for self-directed language learning.
4.4.2. Accuracy
The autogenerated accuracy score, which is based on phonemic closeness to a native speaker’s pronunciation, is more aligned with the human rater, as indicated by the correlation coefficient value of 0.450. This is considered a medium-strength correlation, meaning that while autogenerated scores are generally aligned with human perception, there are differences and subjective human ratings may differ from AI-generated results. The qualitative comments on the samples with the biggest (36.8) and smallest (0.7) scoring differences are listed in
Table 3.
Similar to our observations of fluency, the ranges of human (50) and AI (11.5) accuracy scores exhibited a marked difference, with the former more than four times broader than the latter. We believe this is a result of using phoneme-level accuracy as the basis for scoring. As noted in the human rater’s qualitative comments, human perception and cognition are highly sensitive to inaccurate grammar and mispronunciation. Incorrect stress in a word such as “produce” (noun: /ˈprɒdʒ.uːs/, verb /prəˈdʒuːs/), for example, was picked up by humans but not by the AI tool. In fact, 15 out of 36 video samples were scored more than 20 points higher by the AI tool. Based on these observations, we propose that an AI accuracy score based on phonemes can only be useful when it is combined with other metrics and perhaps requires a different label in a self-directed language learning tool.
4.4.3. Emotion
The AI emotion scoring tool, which focuses on facial expression, has a very strong advantage: it does not become fatigued from continuously looking at students’ faces; it can perform under poor lighting conditions and even when the speaker window is small; it is more impartial than any human rater; it can also provide many more details on eight different emotions, namely, anger, contempt, disgust, fear, happiness, neutrality, sadness, and surprise. The human rater’s scores focused primarily on happy and neutral emotions, despite all eight options being provided on the mark sheet. That said, the rater did provide comments beyond facial expressions, unlike the AI scoring tool, such as “engaging/distractive hand gestures”, “display of interest”, “uplifting/happy voice”, “loud and clear sound”, “enthusiastic”, “stilted”, “self-absorbed and unaware of an audience”, “energetic”, and “emotionless.” Self-directed learners may find these qualitative comments more useful than a percentage value for each of the eight facial expressions.
4.4.4. Vocal Fillers
As our sample videos are collected from non-native English speakers, and since vocal fillers are known to be a prominent problem for many EAL learners during oral presentations, an AI tool was developed to identify a list of fillers. While the AI tool can count most of the listed filler words, it does not identify non-word fillers such as “hmm” and “err”. We consider this a largely effective function, as it frees a human grader from the tedious task of counting unwanted words in a presentation and offers concrete feedback to students who may be unaware of their overuse of a particular filler. On the other hand, we need to fine-tune the current tool so that it can identify and mark non-word utterances for students’ reference. An additional list of qualitative feedback can also help students understand the importance of reducing fillers and learn techniques for that purpose.
4.4.5. Overall Observations from the Comparison between AI and Human Scoring
In terms of reliability, the AI vocal fillers identification function is most useful, as it can automatically report repeated utterances, which are usually undesirable. Students can easily eliminate them as they continue to practice with the platform. The AI-generated accuracy score, which has a medium-strength correlation with a human score, can become more useful if it is designed to show exactly which utterance is considered inaccurate.
The AI fluency score and the emotion identifier are both weaker performers among the tested features, therefore their use in the platform must be further investigated and fine-tuned. For example, the current AI-generated fluency score may be integrated with another metric such as words per minute, so that platform users can act on the given scores in a more meaningful way.
Similar inter-rater comparisons must be carried out for new platform features in the future, firstly as a reliability test, and secondly to explore opportunities to adjust and consolidate into more meaningful feedback to support users’ self-assessed learning.
5. Conclusions
Our AI-assisted presentation training platform was created based on three observations: (i) EAL students need more practice giving oral presentations in English; (ii) discipline faculty (such as engineering teachers) often teach large classes with oral presentation assignments and have time to focus on the content of students’ oral presentations but not their communicative aspects; and (iii) technological advancements have raised the possibility of AI scoring as a means of giving feedback to students about certain aspects of their oral presentations. The speaking practice needs of EAL students were corroborated via two surveys for students and teachers, the results of which then formed the design of our AI training platform. The first build of the platform revealed numerous challenges that compelled us to evaluate the preliminary platform via two channels: a beta test with 24 students and a comparison of AI and human scoring of the presentation performances by 36 students. While the former yielded highly encouraging results of students finding the AI training useful in numerous aspects of communication (such as structure, delivery, and pronunciation), the latter showed a weak to medium association between AI and human raters in terms of fluency and accuracy, respectively, with the AI tool having the distinct advantage of continuous capture of facial emotions—even in low-light situations—and the counting of vocal fillers. The main differences between AI and human ratings stem from the (in)ability of AI to comprehend nuances. Not all abstract concepts (e.g., fluency) can be measured through AI (e.g., silence breaks between words, as pausing can be a purposeful act for dramatic effect rather than an indication of hesitancy or disfluency). In contrast, the human scorer cannot match the ability of AI to register larger numerical occurrences but can detect the multiple facets of communication and their interplay.
As presented in the
Section 4 above, AI evaluation of oral presentations has limitations in its current stage of development and cannot be exclusively relied upon as an assessment tool. Communication is multifaceted and immensely complex; to become more aligned with human rating and to supplement human raters, the training platform should include more AI tools to strengthen the rating of the important concepts of fluency, accuracy, facial expressions, and to examine the possibility of expanding the scope of AI evaluation to include other key aspects of presentations, such as eye contact and helpful signposting. Future work is also warranted in fine-tuning the definition of the criteria for assessing presentations to improve the reliability of AI evaluation. The potential of artificial and human intelligence working in tandem should continue to be explored in the ongoing search for solutions that can enhance learning experiences and improve learning outcomes.